information retrieval evaluation wbia 黄连恩 [email protected] 北京大学信息工程学院...

51
Information Retrieval Evaluation http://net.pku.edu.cn/~wbia 黄黄黄 [email protected] 黄黄黄黄黄黄黄黄黄黄 10/21/2014

Upload: shauna-carson

Post on 19-Dec-2015

294 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Information Retrieval Evaluation

http://net.pku.edu.cn/~wbia黄连恩

[email protected]北京大学信息工程学院

10/21/2014

Page 2: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

IR Evaluation Methodology

Page 3: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Measures for a search engine

创建 index 的速度 Number of documents/hour Documents size

搜索的速度 响应时间: Latency as a function of index size 吞吐率: Throughput as a function of index size

查询语言的表达能力 Ability to express complex information needs Speed on complex queries

These criteria measurable.但更关键的 measure 是

user happiness怎样量化地度量它?

These criteria measurable.但更关键的 measure 是

user happiness怎样量化地度量它?

3

Page 4: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Measuring user happiness

Issue: 谁是 user? Web engine: user finds what they want and return

to the engine Can measure rate of return users

eCommerce site: user finds what they want and make a purchase

Is it the end-user, or the eCommerce site, whose happiness we measure?

Measure time to purchase, or fraction of searchers who become buyers?

Enterprise (company/govt/academic): Care about “user productivity”

How much time do my users save when looking for information?

Many other criteria having to do with breadth of access, secure access, etc.

4

Page 5: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Happiness: elusive to measure

Commonest proxy: relevance of search results

But how do you measure relevance? Methodology: test collection(corpus)

1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair Some work on more-than-binary, but not the

standard

5

Page 6: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Evaluating an IR system

Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query

E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has those words

6

Page 7: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Evaluation Corpus• TREC - National Institute of Standards and Testing (NIST) has run a larg

e IR test bed for many years• Test collections consisting of documents, queries, and relevance

judgments, e.g., – CACM: Titles and abstracts from the Communications of the ACM

from 1958-1979. Queries and relevance judgments generated by computer scientists.

– AP: Associated Press newswire documents from 1988-1990 (from TREC disks 1-3). Queries are the title fields from TREC topics 51-150. Topics and relevance judgments generated by government information analysts.

– GOV2: Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701-850. Topics and relevance judgments generated by government analysts.

7/N

Page 8: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Test Collections

8/N

Page 9: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

TREC Topic Example

9/N

Page 10: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Relevance Judgments

• Obtaining relevance judgments is an expensive, time-consuming process– who does it?– what are the instructions?– what is the level of agreement?

• TREC judgments– depend on task being evaluated– generally binary– agreement good because of “narrative”

10/N

Page 11: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Pooling• Exhaustive judgments for all documents in a

collection is not practical• Pooling technique is used in TREC– top k results (for TREC, k varied between 50 and 200)

from the rankings obtained by different search engines (or retrieval algorithms) are merged into a pool

– duplicates are removed– documents are presented in some random order to the

relevance judges• Produces a large number of relevance judgments for

each query, although still incomplete

11/N

Page 12: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

IR Evaluation Metrics

Page 13: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Unranked retrieval evaluation:Precision and Recall

Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved)

Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)

精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp fp

Not Retrieved

fn tn

13

Page 14: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Accuracy

给定一个 Query ,搜索引擎对每个文档分类 classifies as “Relevant” or “Irrelevant”.

Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure in IR?

Relevant Not Relevant

Retrieved tp fp

Not Retrieved fn tn

14

Page 15: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

15

Page 16: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Precision and recall when ranked

把集合上的定义扩展到 ranked list 在 ranked list 中每个文档处,计算 P/R point 这样计算出来的值,那些是有用的?

Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs

e.g., precision at rank 20 Consider value only at fixed recall points

e.g., precision at 20% recall May be more than one precision value at a

recall point

16

Page 17: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Precision and Recall example

17

Page 18: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Average precision of a query

Often want a single-number effectiveness measure

Average precision is widely used in IR Calculate by averaging precision when

recall increases

18

Page 19: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Recall/precision graphs

Average precision .vs. P/R graph

AP hides information Recall/precision graph has

odd saw-shape if done directly

但是 P/R 图很难比较

19

Page 20: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Precision and Recall, toward averaging

20

Page 21: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Averaging graphs: a false start

How can graphs be averaged? 不同的 queries 有不同的 recall values

What is precision at25% recall?

插值 interpolate How?

21

Page 22: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Interpolation of graphs

可能的插值方法 No interpolation

Not very useful Connect the dots

Not a function Connect max Connect min Connect average …

0%recall 怎么处理 ? Assume 0? Assume best? Constant start?

22

Page 23: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

How to choose?

一个好的检索系统具有这样的特点:一般来说( On average ),随着 recall 增加 , 它的 precision 会降低

Verified time and time again On average

插值,使得 makes function monotonically decreasing

比如 : 从左往右,取右边最大 precisions 值为插值

where S is the set of observed (R,P) points 结果是一个 step function

23

Page 24: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Our example, interpolated this way

monotonically decreasing Handles 0% recall smoothly

24

Page 25: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Averaging graphs: using interpolation

Asked: what is precision at 25% recall?

Interpolate values

25

Page 26: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Averaging across queries

多个 queries 间的平均 微平均 Micro-average – 每个 relevant document 是一个点,用来计算平均

宏平均 Macro-average – 每个 query 是一个点,用来计算平均

Average of many queries’ average precision values

Called mean average precision (MAP) “Average average precision” sounds weird

Mostcommon

26

Page 27: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Interpolated average precision

Average precision at standard recall points For a given query, compute P/R point for every relevant doc

doc. Interpolate precision at standard recall levels

11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) 3-pt is usually 75%, 50%, 25%

Average over all queries to get average precision at each recall level

Average interpolated recall levels to get single result Called “interpolated average precision”

Not used much anymore;MAP “mean average precision” more common

Values at specific interpolated points still commonly used

27

Page 28: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Interpolation and averaging

28

Page 29: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

A combined measure: F

P/R 的综合指标 F measure (weighted harmonic mean):

通常使用 balanced F1 measure( = 1 or = ½)

Harmonic mean is a conservative average , Heavily penalizes low values of P or R

RP

PR

RP

F

2

2 )1(1)1(

11

29

Page 30: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Averaging F, example

Q-bad has 1 relevant document Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2%

Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2%

Macro average (0.2% + 66.2%) / 2 = 33.2%

Micro average (0.2% + 18% + … 100%) / 11 = 60.2%

30

Page 31: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Focusing on Top Documents

• Users tend to look at only the top part of the ranked result list to find relevant documents

• Some search tasks have only one relevant document– e.g., navigational search, question answering

• Recall not appropriate– instead need to measure how well the search

engine does at retrieving relevant documents at very high ranks

31/N

Page 32: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Focusing on Top Documents

• Precision at N• Precision at Rank R– R typically 5, 10, 20– easy to compute, average, understand– not sensitive to rank positions less than R

• Reciprocal Rank– reciprocal of the rank at which the first relevant

document is retrieved– Mean Reciprocal Rank (MRR) is the average of the

reciprocal ranks over a set of queries– very sensitive to rank position32/N

Page 33: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Discounted Cumulative Gain

• Popular measure for evaluating web search and related tasks

• Two assumptions:– Highly relevant documents are more useful than

marginally relevant document– the lower the ranked position of a relevant

document, the less useful it is for the user, • since it is less likely to be examined

33/N

Page 34: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Discounted Cumulative Gain

• Uses graded relevance as a measure of the usefulness, or gain, from examining a document

• Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

• Typical discount is 1/log (rank)– With base 2, the discount at rank 4 is 1/2, and at

rank 8 it is 1/3

34/N

Page 35: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Discounted Cumulative Gain

• DCG is the total gain accumulated at a particular rank p:

• Alternative formulation:

– used by some web search companies– emphasis on retrieving highly relevant documents

35/N

Page 36: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

DCG Example

• 10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0

• discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0

• DCG:3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61

36/N

Page 37: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Normalized DCG

• DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking– makes averaging easier for queries with different

numbers of relevant documents

37/N

Page 38: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

NDCG Example

• Perfect ranking:3, 3, 3, 2, 2, 2, 1, 0, 0, 0

• ideal DCG values:3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10

• NDCG values (divide actual by ideal):1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88– NDCG 1 at any rank position

38/N

Page 39: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Using Preferences

• Two rankings described using preferences can be compared using the Kendall tau coefficient (τ ):

– P is the number of preferences that agree and Q is the number that disagree

• For preferences derived from binary relevance judgments, can use BPREF

39/N

Page 40: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Testing and Statistics

Page 41: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Significance Tests

• Given the results from a number of queries, how can we conclude that ranking algorithm A is better than algorithm B?

• A significance test enables us to reject the null hypothesis (no difference) in favor of the alternative hypothesis (B is better than A)– the power of a test is the probability that the test

will reject the null hypothesis correctly– increasing the number of queries in the

experiment also increases power of test

41/N

Page 42: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

One-Sided Test

• Distribution for the possible values of a test statistic assuming the null hypothesis

• shaded area is region of rejection

42/N

Page 43: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Example Experimental Results

43/N

Page 44: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

t-Test

• Assumption is that the difference between the effectiveness values is a sample from a normal distribution

• Null hypothesis is that the mean of the distribution of differences is zero

• Test statistic

– for the example,

44/N

Page 45: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Sign Test

• Ignores magnitude of differences• Null hypothesis for this test is that– P(B > A) = P(A > B) = ½– number of pairs where B is “better” than A

would be the same as the number of pairs where A is “better” than B

• Test statistic is number of pairs where B>A• For example data, – test statistic is 7, p-value = 0.17– cannot reject null hypothesis

45/N

Page 46: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Online Testing

• Test (or even train) using live traffic on a search engine

• Benefits:– real users, less biased, large amounts of test data

• Drawbacks:– noisy data, can degrade user experience

• Often done on small proportion (1-5%) of live traffic

46/N

Page 47: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

本次课小结

IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP P@N,P@R DCG,NDCG BPREF

47

Page 48: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

Thank You!

Q&A

Page 49: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

阅读材料

[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4 [2] M. P. Jay and W. B. Croft, "A language

modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia: ACM Press, 1998.

49

Page 50: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

习题 8-9 [**] 在 10,000 篇文档构成的文档集中,某个查询的相关文档总数为 8 ,下面给出了某系统针对该查询的前 20 个有序结果的相关 ( 用 R 表示 ) 和不相关 ( 用 N 表示 ) 情况,其中有 6 篇相关文档:

RRNNN NNNRN RNNNR NNNNR  a. 前 20 篇文档的正确率是多少? b. 前 20 篇文档的 F1 值是多少 ?  c. 在 25% 召回率水平上的插值正确率是多少? d. 在 33% 召回率水平上的插值正确率是多少? e. 计算其 MAP 值。

Page 51: Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014

#2. Evaluation

定义 precision-recall graph 如下:对一个查询结果列表,每一个返回结果文档处计算 precision/recall 点,由这些点构成的图 .

在这个图上定义 breakeven point 为 precision和 recall 值相等的点 .

问:存在多于一个 breakeven point 的图吗?如果有,给出例子;没有的话,请证明之。