assia2015sakai

Evaluation 1: System-OrientedTetsuya Sakai

@tetsuyasakai

Waseda University

August 24, 2015@ASSIA 2015, Taipei.

About Tetsuya Sakai

• Professor – Department of Computer Science at Waseda University• Associate Dean – IT Strategies Division of Waseda University• Visiting professor – National Institute of Informatics• Researcher in information retrieval, natural language processing,

interaction• Editor-in-chief (Asia/Australasia) – Information Retrieval Journal (Springer)• SIGIR 2013 PC co-chair• SIGIR 2017 general co-chair• NTCIR general co-chair• Toshiba → Cambridge U → Toshiba → NewsWatch→ Microsoft Research Asia → Waseda

LECTURE OUTLINE

1. Why evaluate?

2. Set retrieval evaluation measures

3. Ranked retrieval evaluation measures

4. More evaluation measures

5. Statistical significance, power, effect sizes

6. Summary

7. References

• IR researchers’ goal: build systems that satisfy the user’s information needs.

• We cannot ask users all the time, so we need measures as surrogates of user satisfaction/performance.

• “If you cannot measure it, you cannot improve it.” http://zapatopi.net/kelvin/quotes/

system

system

system

MeasureU

ser satisfaction

Improvements

Does it correlate with user satisfaction?

Why measure?

Improvements that don’t add up [Armstrong09]Armstrong et al. analysed 106 papers from SIGIR ’98-’08, CIKM ’04-’08 that used TREC data, and reported:

• Researchers often use low baselines

• Researchers claim statistically significant improvements, but the results are often not competitive with the best TREC systems

• IR effectiveness has not really improved over a decade!

What we want What we’ve got?

The best IR system in the world

I’ve invented an IR system



A

I’ve built Test Collection A

to evaluate it



A


to evaluate it

A

I’ve evaluated my system with A and

it’s the best



A


to evaluate it

A

I’ve evaluated my system with A and

it’s the best


B

I’ve built Test collection B

to evaluate it

B

I’ve evaluated my system with B and

it’s the best

A typical test collection

TopicRelevance assessments

(relevant/nonrelevant documents)

Document collection





: :

Topic set

:

“Qrels”The Sakai Lab home page

sakailab.com: relevantwww.f.waseda.jp/tetsuya/: relevant

http://tanabe-agency.co.jp/talent/sakai_masato/: nonrelevant

LECTURE OUTLINE

1. Why evaluate?





6. Summary

7. References

Recall, Precision and E-measure [vanRijsbergen79]

• E-measure = (|A∪B|-|A∩B|)/(|A|+|B|)

= 1 – 1/(0.5*(1/Prec) + 0.5*(1/Rec))

where Prec=|A∩B|/|B|, Rec=|A∩B|/|A|.

A generalised form

= 1 – 1/(α*(1/Prec) + (1-α)*(1/Rec))

= 1 – (β + 1)*Prec*Rec/(β *Prec+Rec)

where α = 1/(β + 1).

A: Relevant docs B: Retrieved docs

A ∩ B

2 2

2

F-measure

• F-measure = 1 – E-measure

= 1/(α*(1/Prec) + (1-α)*(1/Rec))

= (β + 1)*Prec*Rec/(β *Prec+Rec)

where α = 1/(β + 1).

• F with β=b is often expressed as Fβ Fb.

• F1 = 2*Prec*Rec/(Prec+Rec)

i.e. harmonic mean of Prec and Rec

2 2

2User attachesβ times as much importance to Rec as Prec(dE/dRec=dE/dPrecwhenPrec/Rec=β)[vanRijsbergen79]

Harmonic vs. arithmetic mean

0

0.3

0.6

0.90

0.2

0.4

0.6

0.8

1

00.10.20.30.40.50.60.70.80.91

0.8-1

0.6-0.8

0.4-0.6

0.2-0.4

0-0.2

0

0.3

0.6

0.90

0.2

0.4

0.6

0.8

1

00.10.20.30.40.50.60.70.80.91

0.8-1

0.6-0.8

0.4-0.6

0.2-0.4

0-0.2

Prec=0, Rec=1

Prec=0.5, Rec=0.5

F1=0

F1=0.5

Prec=0.1, Rec=0.9

F1=0.18

(Prec+Rec)/2=0.5

(Prec+Rec)/2=0.5

(Prec+Rec)/2=0.5

Balance important

Balance NOT

important

LECTURE OUTLINE

1. Why evaluate?





6. Summary

7. References

Interpolated precision

relevant

nonrel

nonrel

relevant

relevant

nonrel

relevant

nonrel

1

2

3

4

5

6

7

8

Rec(r) Prec(r)

0.2 1

0.2 0.5

0.2 0.33

0.4 0.5

0.6 0.6

0.6 0.5

0.8 0.57

0.8 0.5

R=5

0 1

0.1 1

0.2 1

0.3 0.6

0.4 0.6

0.5 0.6

0.6 0.6

0.7 0.57

0.8 0.57

0.9 0

1 0

i IPiInterpolated Precision

IPi = max Prec(r)r s.t. Rec(r)>=i

“The major issue addressed by interpolation is that it rarely happens thatany particular recall point is achieved.” [Buckley05, p.56]

Recall-precision graphs

0 1

0.1 1

0.2 1

0.3 0.6

0.4 0.6

0.5 0.6

0.6 0.6

0.7 0.57

0.8 0.57

0.9 0

1 0

i IPi

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

To draw a Rec-Prec curve for a set (T)

of topics, plot ΣT IPi / |T| for each i

Interpolated Precision

IPi = max Prec(r)r s.t. Rec(r)>=i

Recall level i

Inte

rpo

late

d p

reci

sio

n a

t i

Average Precision [Buckley05]

• Introduced at TREC-2 (1993), implemented in trec_eval by Buckley

R: total number of relevant docsr: document rankI(r): flag indicating a relevant docC(r): number of relevant docswithin ranks [1,r]

Highly rel

Partially rel

Highly rel

Partially rel

Partially rel

Partially rel

=

Most widely-used binary-relevanceIR metric since 1990s,but cannot distinguish between Systems A and B..

System A System B

A user model for AP [Robertson08]

• Different users stop scanning the ranked list at different ranks. They only stop at a relevant document.

• The user distribution is uniformacross all (R) relevant documents.

• At each stopping point, compute utility (Prec).

• Hence AP is the expected utility for the user population.

Normalised Discounted Cumulative Gain [Jarvelin02]

• Introduced at ACM SIGIR 2000/TOIS 2012, a variant of the sliding ratio [Pollack68]

• Popular “Microsoft version” [Burges05] :

Original definition [Jarvelin02] not recommended: a system that returns a relevant document at rank 1 and one that returns a relevant document at rank b are treated as equally effective, where b is the logarithm base (patience parameter). b’s cancel out in the Burges definition.

md: document cutoff (e.g. 10)g(r): gain value at rank re.g. 1 if doc is partially relevant

3 if doc is highly relevantg*(r) gain value at rank r of anideal ranked list

nDCG: an example

Q-measure [Sakai05AIRS,Sakai07IPM]

• A graded relevance version of AP (see also Graded AP [Robertson10]).

• Same user model as AP, but the utility is computed using the blended ratio BR(r) instead of Prec(r).

•

where

β: patience parameter(β=0 ⇒ BR=Prec, hence Q=AP)

Combines Precision and normalised cumulative gain (nCG) [Jarvelin02]

Value of the first relevant document at rank r according to BR(r) (binary relevance, R=5) [Sakai14PROMISE]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

β=0.1

β=1

β=10

r<=R ⇒BR(r)=(1+β)/(r+βr)=1/r=P(r)r>R ⇒BR(r)=(1+β)/(r+βR)

rank

Large β ⇒more tolerance to relevant

docs at low ranks

Q: An example (with β=1)

Normalised Cumulative Utility [Sakai08EVIA]

• Generalises AP and Q

• NCU =

Σ Pr(r)*NU(r)r

NormalisedUtility

Prec(r) or BR(r)

Stoppingprobability

Rank-biased Graded-uniform

Expected Reciprocal Rank [Chapelle09]

• "ERR can be seen as a special case of Normalized Cumulative Utility (NCU)“ [Chapelle09, p.625]

• No recall component

where

Probability that the user is finally satisfied at r

Utility at r

ERR’s diminishing return property

“Thus, if for example document two merely restates information already gleaned from document one and hence is of no actual benefit to this user, he may wish to assign it a negative document utility, no matter how ‘relevant’ its content might have been to the original information need.” [Cooper73, p.90]

“This is a diminishing return property which seems highly desirable for most IR tasks: if we have already shown a lot of relevant documents, there should be less added value in showing more relevant documents.” [Chapelle11,p.582]

Rank-biased NCU [Sakai08EVIA] also has this property

Ranked retrieval measures: summary 1 (not exhaustive)

AP Q-measure ERR nDCG

Handling graded relevance

Diminishing return (navigational intent)

Discriminative power[Sakai06SIGIR,07SIGIR]

Widely used

Used widely at NTCIR

NCU = [Sakai08EVIA]f( stopping_probability_over_r, utility_at_r )

How many statistically significant

system pairs can be obtained

(See Section 5)

There are a few graded-relevance versions, but AP almost always means binary-relevance AP

LECTURE OUTLINE

1. Why evaluate?





6. Summary

7. References

Time-Biased Gain (TBG) [Smucker12]

Gain at rank r Discounting based on time to reach r

Value of information decays with time

Time to reach r:reads (r-1) snippets, and possibly click

some docs and read them

Snippet readingtime: a constant

Doc reading time:linear with doc length

U-measure [Sakai13SIGIR]

• U can be used not only for traditional IR, but also for various other tasks such as session IR, aggregated search, summarisation, question answering etc.

• While other measures are based on ranks, U abandons the notion of rank. Focusses on the amount of text that the user has read within a search session.

Instead of ranks, uses the positions of relevant pieces of information on a trailtext

Trailtext for UJust concatenate all the texts that the

user has (probably) read.For web search, one simple user model would be to assume that users read all

snippets, plus parts of relevant documents

If the nonrel at rank 2(snippet) is replacedwith a rel(snippet + full text),the value of the rel at rank 4 is always reduced

Satisfies diminishing return

fixed-length snippets

Position-based discounting for U

Ranked retrieval measures: summary 2 (not exhaustive)AP Q-measure ERR nDCG TBG U-measure

Handling graded relevance

Diminishing return (navigational intent)

Discriminative power[Sakai06SIGIR,07SIGIR]

Considers document lengths and search engine snippets

Handles nonlinear traversal[Sakai14PROMISE]

Widely used Users do NOT always scan from top to

bottom!TREC Contextual Suggestion

Diversified search – a new IR taskSince 2003 or so

• Given an ambiguous/underspecified query, produce a single Search Engine Result Page that satisfies different user intents!

• Challenge: balancing relevance and diversity

SER

P (

Sear

ch E

ngi

ne

Res

ult

Pag

e)

Highly relevant near the top

Give more space to

popular intents?

Give more space to informational

intents?

Cover many intents

Diversity test collectionshave relevance assessmentsfor each intent,rather than for each topic

Diversified search measures summaryα-nDCG[Clarke08]

ERR-IA[Chapelle11]

D#-nDCG[Sakai11SIGIR]

DIN#-nDCG, P+Q#[Sakai12WWW]

U-IA[Sakai13SIGIR]

Handling per-intent graded relevance

Handling intent probabilities

Handling both informational and navigational intents

Per-intent diminishing return

Discriminative power [Sakai06SIGIR,07SIGIR]

Concordance test [Sakai12WWW,13IRJ]

Considers document lengths and search engine snippets

Widely used

Agree with simple measures?

M-measure@NTCIR MobileClick

LECTURE OUTLINE

1. Why evaluate?





6. Summary

7. References

So you used a test collection that has n=20 topics to compute nDCG scores for two systems X and Y.

Which system is more effective?

Scores for X, Y:

Per-topic difference:

Sample mean of the differences:

Sample variance: 0.0750

0.0251

Population distribution of X

Population distribution of Y

Random sampling from normal distributions

Under the above assumptions,

obeys where

Population mean

Population variance

Population mean of the difference

Under the above assumptions,

obeys where

Population mean of the difference

Which system is more effective?Or, which of these hypotheses is true?

If you look at the populations, X and Y are equally effective

If you look at the populations, X and Y are actually different

obeys

If you look at the populations, X and Y are equally effective

If you look at the populations, X and Y are actually different

Which of these hypotheses is true? All we have is the sample data:

If H0 is true, this t statistic obeys a t distribution withφ=(n-1) degrees of freedom.

Sum of squares

Number of independent variables in a sum of squares = accuracy of the sum

If H0 is true, this t statistic obeys a t distribution withφ=(n-1) degrees of freedom.

0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 Observed value of t0 computed from sample

P-value: area under curve = probability of observing t0 or

something more extreme IF H0 is true.

If H0 is true, the t statistic obeys a t distribution withφ=(n-1) degrees of freedom.

0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 t0

P-value: area under curve = probability of observing t0 or

something more extreme IF H0 is true.(1-α)

Significance level α: areas under curve =a pre-determined probability (e.g. 5%) of observing something very rare

0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 t0

α/2 α/2

(1-α)

p-value

If p-value <= α, then something highly unlikely (e.g. 5% chance)under H0 has happened ⇒ H0 is probably wrong, with(1-α)% (e.g. 95%) confidence!

We reject H0, and say thatthe difference is statistically significant at the significance levelof α. The population meansare probably different!

0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 t0

α/2 α/2

(1-α) p-value

If p-value > α, then what we have observed is somethingwe expect under H0. We accept H0, and say that

the difference is NOT statistically significant at the significance levelof α. This just means that we cannot tell from data whether H0 is true.

Example: paired t-test using Excel

Significance level α = 0.05 (95% confidence)Sample size n = 20Degrees of freedom φ = 20-1 = 19

Sample mean

Sample variance

t statistic

p-value = T.DIST.2T( 2.116, 19 ) = 0.048 < α

X is statistically significantly better than Y at α=0.05.

Mean nDCGover 20 topics

X 0.3450

Y 0.2700

Limitations of significance testing (1)

• Normality assumptions: computer-based alternatives (bootstrap [Savoy97, Sakai06SIGIR], randomisation test [Smucker07]) that do not rely on the assumptions are available. But the results are similar to those obtained by the t-test.

• Dichotomous decision:

p-value = 0.049 < α ⇒ statistically significant! Publish a paper!

p-value = 0.051 > α ⇒ not statistically significant! Put it in the drawer!

Saying “p-value=0.049” is much more informative thansaying “significant at α=0.05”. Report the p-value! [Sakai14forum]


0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 t0

p-value

We get a statistically significant resultwhenever p-value is small ⇔ t-value is large.t-value is large when(a) sample size n is large; or (b) sample effect size

is large.Difference measured in standard

deviation units

If n is large, you can get a statistically significantresult with ANYTHING!


0

0.1

0.2

0.3

0.4

-5.0

-4.5

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

φ=4 φ=99 t0

p-value

t-value is large when(a) sample size n is large; or (b) sample effect size is large.

Don’t just report the p-value. Report the sample effect size! [Sakai14forum]

0.4734 in the previous example.This reflect how substantial the difference may be.

What about the sample size n?

In significance testing, there are four important parameters. If three of them are set, the fourth one is uniquely determined.

α: probability of Type I error

β: probability of Type II error

effect size:

magnitude of the difference

sample size n: number of topics

H0 is true H1 is true

H0 accepted 1-α β

H0 rejected α 1-β

Detecting a nonexistent difference

Missing a true difference

While IR test collections typically have n=50 topics, it is possible to determine the right n by setting α, β, and the minimum effect size that you want to detect [Sakai15IRJ].

Statistical power: ability to detect a true difference

Comparing more than two systems

• Conducting a t-test for every system pair is not good (though there are exceptions [Sakai15IRJ]) - the familywise error rate problem.

• Use a proper multiple comparison procedure.

• Recommended: randomised Tukey HSD test [Carterette12,Sakai14PROMISE].

• [Sakai14forum] says do an ANOVA (analysis of variance) test first, followed by a Tukey HSD test. But this also causes a problem similar to the familywise error rate. If you are interested in the difference between every system pair, conduct Tukey without conducting ANOVA.

LECTURE OUTLINE

1. Why evaluate?





6. Summary

7. References

Summary

• Ranked retrieval measures as surrogates of user satisfaction/performance, with different sets of assumptions. They maybe compared using discriminative power [Sakai06SIGIR,07SIGIR], concordance test [Sakai12WWW,13IRJ] etc. We want measures that reliably measure what we want to measure!

• Principles and limitations of statistical significance testing, esp. paired t-test. Report the p-values and effect sizes [Sakai14IRJ]. Type I errors, Type II errors (1-power), effect sizes and sample sizes. A multiple comparison procedure should be used for more than two systems.

Let’s write good IR papers!

Tools (by Tetsuya Sakai)

• NTCIREVAL (computes various evaluation measures)

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

• BOOTS (bootstrap hypothesis test as an alternative to the t-test)

http://research.nii.ac.jp/ntcir/tools/boots-en.html

• Discpower (randomisation test as an alternative to the t-test,

and randomised Tukey HSD test for comparing more than two systems)

http://research.nii.ac.jp/ntcir/tools/discpower-en.html

• Topic set size design Excel tools (how many topics do we need?):

http://www.f.waseda.jp/tetsuya/tools.html

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

http://research.nii.ac.jp/ntcir/tools/boots-en.html

http://research.nii.ac.jp/ntcir/tools/discpower-en.html

http://www.f.waseda.jp/tetsuya/tools.html

LECTURE OUTLINE

1. Why evaluate?





6. Reporting your results

7. Summary

8. References

References (1)

[Armstrong09] Armstrong, T.G., Moffat, A., Webber, W. and Zobel, J.: Improvements that Don’t Add Up: Ad-hoc Retrieval Results Since 1998, ACM CIKM 2009, pp.601-610, 2009.

[Buckley05] Buckley, C. and Voorhees, E.M.: Retrieval System Evaluation, In TREC: Experiment and Evaluation in Information Retrieval (Voorhees, E.M. and Harman, D.K., eds.), Chapter 3, The MIT Press, 2005.

[Burges05] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to Rank Using Gradient Descent, ICML 2005, pp.89-96, 2005.

[Chapelle09] Chapelle, O., Metzler, D., Zhang, Y., Grispan, P.: Expected Reciprocal Rank for Graded Relevance, ACM CIKM 2009, pp.621-630, 2009.

[Chapelle11] Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based Diversification of Web Search Results: Metrics and Algorithms, Information Retrieval, 14(6), pp.572-592, 2011.

[Clarke08] Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Buttcher, S. and MacKinnon, I.: Novelty and Diversity in Information Retrieval Evaluation, ACM SIGIR 2008, pp.659-666, 2008.

[Cooper73] Cooper, W.S.: On Selecting a Measure of Retrieval Effectiveness, JASIS 24(2), pp.87–100, 1973.

References (2)

[Jarvelin02] Jarvelin, K. and Kekalainen, J.: Cumulated Gain-based Evaluation of IR Techniques, ACM TOIS, 20(4), p.422-446, 2002.

[Pollack68] Pollack, S.M.: Measures for the Comparison of Information Retrieval Systems, American Documentation, 19(4), pp.387-397, 1968.

[Robertson08] Robertson, S.E.: A New Interpretation of Average Precision, ACM SIGIR 2008, pp.689-690, 2008.

[Robertson10] Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending Average Precision to Graded Relevance Judgments, ACM SIGIR 2010, pp.603-610, 2010.

[Savoy97] Savoy, J.: Statistical Inference in Retrieval Effectiveness Evaluation, Information Processing and Management, 33(4), pp.495-512, 1997.

References (3)[Sakai05AIRS] Sakai, T.: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS 2004 (LNCS 3411), pp.251-262, 2005.

[Sakai06SIGIR] Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006, pp.525-532, 2006.

[Sakai07SIGIR] Sakai, T.: Alternatives to Bpref, ACM SIGIR 2007, pp.71-78, 2007.

[Sakai07IPM] Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, 43(2), pp.531-548, 2007.

[Sakai08EVIA] Sakai, T. and Robertson, S.: Modelling A User Population for Designing Information Retrieval Metrics, EVIA 2008, pp.30-41, 2008.

[Sakai11SIGIR] Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-Intent Graded Relevance, ACM SIGIR 2011, pp.1043-1052, 2011.

[Sakai12WWW] Sakai, T.: Evaluation with Informational and Navigational Intents, WWW 2012, pp.499-508, 2012.

[Sakai13SIGIR-U] Sakai, T., Dou, Z.: Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, ACM SIGIR 2013, pp.473-482, 2013.

[Sakai13IRJ] Sakai, T. and Song, R.: Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task, Information Retrieval, 16(4), pp.504-529, Springer, 2013.

[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests, PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), Springer, pp.116-163, 2014.

[Sakai14forum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), pp.3-12, 2014.

[Sakai15IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, submitted.

References (4)

[Smucker07] Smucker, M.D., Allan, J. and Carterette, B.: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation, ACM CIKM 2007, pp.623-632, 2007.

[Smucker12] Smucker, M.D. and Clarke, C.L.A.: Time-based Calibration of Effectiveness Measures, ACM SIGIR 2012, pp. 95–104 , 2012.

[vanRijsbergen79] van Rijsbergen, C.J., Information Retrieval, Chapter 7, Butterworths, 1979.

assia2015sakai

Technology