evaluation of ir systems adapted from lectures by prabhakar raghavan (google) and christopher...
TRANSCRIPT
Evaluation of IR Systems
Adapted from Lectures by
Prabhakar Raghavan (Google) and
Christopher Manning (Stanford)
Prasad 1L10Evaluation
2
This lecture
Results summaries: Making our good results presentable and usable to
a user
How do we know if our results are any good? Evaluating a search engine
Benchmarks Precision and recall
Prasad L10Evaluation
3
Result Summaries
Having ranked the documents matching a query, we wish to present a results list.
Most commonly, a list of document titles plus a short summary, aka “10 blue links”.
4
Summaries
The title is typically automatically extracted from document metadata. What about the summaries?
This description is crucial. User can identify good/relevant hits based on description.
Two basic kinds: A static summary of a document is always the same,
regardless of the query that hit the doc. A dynamic summary is a query-dependent attempt to
explain why the document was retrieved for the query.
Prasad L10Evaluation
Static summaries
In typical systems, the static summary is a subset of the document. Simplest heuristic: the first 50 (or so – this can be
varied) words of the document Summary cached at indexing time
More sophisticated: extract from each document a set of “key” sentences
Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR (cf. text summarization work)
5
Dynamic summaries
Present one or more “windows” within the document that contain several of the query terms
“KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring
If query found as a phrase, all or some occurrences of the phrase in the doc
If not, doc windows that contain multiple query terms
The summary itself gives the entire content of the window – all terms, not only the query terms.
7
Generating dynamic summaries
If we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.
If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.
E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content.
Most often, cache only a fixed-size prefix of the doc. Note: Cached copy can be outdated
Prasad L10Evaluation
8
Dynamic summaries
Producing good dynamic summaries is a tricky optimization problem. The real estate for the summary is normally small
and fixed. Want snippets to be long enough to be useful. Want linguistically well-formed snippets. Want snippets maximally informative about doc.
But users really like snippets, even if they complicate IR system design.
Prasad L10Evaluation
Alternative results presentations?
An active area of HCI research An alternative: http://www.searchme.com / copies the
idea of Apple’s Cover Flow for search results
Prasad 9L10Evaluation
Evaluating search engines
Prasad 10L10Evaluation
11
Measures for a search engine How fast does it index?
e.g., number of bytes per hour How fast does it search?
e.g., latency as a function of queries per second What is the cost per query?
in dollars
All of the preceding criteria are measurable: we can quantify speed / size / money.
However, the key measure for a search engine is user happiness.
Prasad L10Evaluation
12
Data Retrieval vs Information Retrieval
DR Performance Evaluation (after establishing correctness) Response time Index space …
IR Performance Evaluation How relevant is the answer set? How happy are
the users? (Required to establish “functional correctness”, e.g., through benchmarks)
Prasad L10Evaluation
13
Measures for a search engine
What is user happiness? Factors include:
Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free)
None of these is sufficient: blindingly fast, but useless answers won’t make a user happy.
How can we quantify user happiness?
Prasad L10Evaluation
14
Measuring user happiness: Who is the user?
Web search engine: searcher. Success: Searcher finds what was looked for. Measure: rate of return to this search engine
Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate
Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers
Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold
Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company
Prasad L10Evaluation
15
Happiness: elusive to measure
Most common proxy: relevance of search results
Standard Methodology in IR: Relevance measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. An assessment of relevance for each query-document pair Some work on binary relevance, others use
multi-valued relevance (or partial orders)
Prasad L10Evaluation
Evaluating an IR system
Note: the information need is translated into a query
Relevance is assessed relative to the information need, not the query E.g., Information need: I'm looking for information
on whether drinking red wine is more effective at reducing heart attack risks than white wine.
Query: wine red white heart attack effective You evaluate whether the doc addresses the
information need, not whether it has these words.
Prasad L10Evaluation 16
Evaluating an IR system
Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.”
Query q: [red wine white wine heart attack]
Consider document d′: At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.
d′ is an excellent match for query q . . . d′ is not relevant to the information need i .
Prasad L10Evaluation 17
18
Difficulties with gauging Relevancy
Relevancy, from a human standpoint, is: Subjective: Depends upon a specific
user’s judgment. Situational: Relates to user’s current
needs. Cognitive: Depends on human
perception and behavior. Dynamic: Changes over time.
Prasad L10Evaluation
Standard relevance benchmarks
TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years
Reuters and other benchmark doc collections used
“Retrieval tasks” specified sometimes as queries
Human experts mark, for each query and for each doc, Relevant or Nonrelevant or at least for subset of docs that some system
returned for that queryPrasad 19
20
Unranked retrieval evaluation:Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp (true positive) fp (false positive)
Not Retrieved fn (false negative) tn (true negative)
Prasad L10Evaluation
21
Precision and Recall
Prasad L10Evaluation
22
Precision and Recall in Practice
Precision The ability to retrieve top-ranked documents that
are mostly relevant. The fraction of the retrieved documents that are relevant.
Recall The ability of the search to find all of the relevant
items in the corpus. The fraction of the relevant documents that are retrieved.
Prasad L10Evaluation
Introduction to Information RetrievalIntroduction to Information Retrieval
23
Accuracy Why do we use complex measures like precision,
recall, etc? Why not something simple like accuracy? Accuracy is the fraction of decisions
(relevant/nonrelevant) that are correct. In terms of the contingency table above,
accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web
information retrieval?
23
24
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget….
People doing information retrieval want to find something and have a certain tolerance for junk.
Search for:
0 matching results found.
Prasad L10Evaluation
25
Precision/Recall
You can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong
empirical confirmation
Prasad L10Evaluation
26
Trade-offs
10
1
Recall
Pre
cisi
on
The idealReturns relevant documents butmisses many useful ones too
Returns most relevantdocuments but includeslot of junk
Prasad L10Evaluation
27
Difficulties in using precision/recall
Should average over large document collection/query ensembles
Need human relevance assessments People aren’t reliable assessors
Assessments have to be binary Nuanced assessments?
Heavily skewed by collection/authorship Results may not translate from one domain to
another
Prasad L10Evaluation
28
A combined measure: F
Combined measure that assesses precision/recall tradeoff is F measure (harmonic mean):
Harmonic mean is a conservative average See CJ van Rijsbergen, Information Retrieval
RP
PR
RP
F
2
112
Prasad L10Evaluation
29
Aka E Measure (parameterized F Measure)
Variants of F measure that allow weighting emphasis on precision over recall:
Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.
PRRP
PRE
1
2
2
2
2
)1()1(
Prasad L10Evaluation
30
F1 and other averages
Combined Measures
0
20
40
60
80
100
0 20 40 60 80 100
Precision (Recall fixed at 70%)
Minimum
Maximum
Arithmetic
Geometric
Harmonic
Prasad L10Evaluation
Introduction to Information RetrievalIntroduction to Information Retrieval
31
F: Example
P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4
31
relevant not relevant
retrieved 20 40 60
not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
Introduction to Information RetrievalIntroduction to Information Retrieval
32
Exercise Compute precision, recall and F1 for this
result set:
32
relevant not relevant
retrieved 18 2
not retrieved 82 1,000,000,000
Recall vs Precision and F1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
an
d F
1
Breakeven Point
Breakeven point is the point where precision equals recall.
Alternative single measure of IR effectiveness.
How do you compute it?
34
Evaluating ranked results
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc. results
Doing this for precision and recall gives you a precision-recall curve.
Prasad L10Evaluation
35
R=3/6=0.5; P=3/4=0.75
Computing Recall/Precision Points: An Example
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
Let total # of relevant docs = 6Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one relevant document.
Never reach 100% recall
L10Evaluation
36
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
37
Averaging over queries
A precision-recall graph for one query isn’t a very sensible thing to look at.
You need to average performance over a whole bunch of queries.
But there’s a technical issue: Precision-recall calculations place some points on
the graph How do you determine a value (interpolate)
between the points?
Sec. 8.4
38
Interpolated precision
Idea: If locally precision increases with increasing recall, then you should get to count that…
So you take the max of precisions to right of value
Sec. 8.4
39
Interpolating a Recall/Precision Curve
Interpolate a precision value for each standard recall level: rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
r0 = 0.0, r1 = 0.1, …, r10=1.0
The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level above the j-th level:
)(max)( rPrPrr
jj
40
Interpolated precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
41
Evaluation Metrics (cont’d)
Graphs are good, but people want summary measures! Precision at fixed retrieval level
Precision-at-k: Precision of top k results Perhaps appropriate for web search: all people want are
good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k
11-point interpolated average precision The standard measure in the early TREC competitions: you
take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them
Evaluates performance at all recall levels
Prasad L10Evaluation
42
Typical (good) 11 point precisions
SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cis
ion
Prasad
Introduction to Information RetrievalIntroduction to Information Retrieval
43
11-point interpolated average precision
11-point average: ≈ 0.425
How can precisionat 0.0 be > 0?
43
Recall InterpolatedPrecision
0.00.10.20.30.40.50.60.70.80.9 1.0
1.00 0.67 0.63 0.55 0.45 0.41 0.36 0.29 0.130.10 0.08
44
11 point precisions
0
20
40
60
80
100
120
0 20 40 60 80 100 120
Recall
Pre
cisi
on
Prasad L10Evaluation
Receiver Operating Characteristics (ROC) Curve
True positive rate =
tp/(tp+fn) = recall = sensitivity
False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1
Why is the blue line “worthless”?
Introduction to Information RetrievalIntroduction to Information Retrieval
46
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.1).
Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query.
That is, there are easy information needs and hard ones.
46
47
Mean average precision (MAP)
MAP for a query Average of the precision value for each (of
the k top) relevant document retrieved This approach weights early appearance of a relevant
document over later appearance
MAP for query collection is the mean (arithmetic average) of AP for each query Macro-averaging: each query counts
equally
Prasad L10Evaluation
Average Precision
Mean Average Precision (MAP) Mean Average Precision (MAP)
summarize rankings from multiple queries by taking mean (averaging) of average precision
most commonly used measure in research papers
assumes user is interested in finding many relevant documents for each query
requires many binary relevance judgments in text collection
MAP
51
Summarize a Ranking: MAP
Given that n docs are retrieved Compute the precision (at rank) where each
(new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. If a relevant document never gets retrieved, we
assume the precision corresponding to that rel. doc to be zero
Compute the average over all the relevant documents Average precision = (p(1)+…p(k))/k
52
(cont’d)
This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document
Mean Average Precisions (MAP) MAP = arithmetic mean average
precision over a set of topics gMAP = geometric mean average
precision over a set of topics (more affected by difficult topics)
Discounted Cumulative Gain
Popular measure for evaluating web search and related tasks.
Two assumptions: Highly relevant documents are more useful
than marginally relevant document. The lower the ranked position of a relevant
document, the less useful it is for the user, since it is less likely to be examined.
Discounted Cumulative Gain
Uses graded relevance as a measure of usefulness, or gain, from examining a document
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2,
and at rank 8 it is 1/3
55
Summarize a Ranking: DCG
What if relevance judgments are in a scale of [1,r]? r>2
Cumulative Gain (CG) at rank n Let the ratings of the n documents be r1, r2, …
rn (in ranked order) CG = r1+r2+…rn
Discounted Cumulative Gain (DCG) at rank n DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
We may use any base for the logarithm, e.g., base=b
Discounted Cumulative Gain DCG is the total gain accumulated at a particular
rank p:
Alternative formulation:
used by some web search companies emphasis on retrieving highly relevant documents
DCG Example
10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0
discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG:
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
58
Summarize a Ranking: NDCG
Normalized Cumulative Gain (NDCG) at rank n Normalize DCG at rank n by the DCG value at
rank n of the ideal ranking The ideal ranking would first return the
documents with the highest relevance level, then the next highest relevance level, etc
Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
NDCG is now quite popular in evaluating Web search
NDCG - Example
iGround Truth Ranking Function1 Ranking Function2
Document Order
riDocument
Orderri
Document Order
ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
6309.44log
0
3log
1
2log
22
222
GTDCG
6309.44log
0
3log
1
2log
22
2221
RFDCG
2619.44log
0
3log
2
2log
12
2222
RFDCG
6309.4 GTDCGMaxDCG
4 documents: d1, d2, d3, d4
Graded ranking/ordering:
DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) = 6.5
IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) = 6.63
NDCG = DCG/IDCG = 6.5/6.63 = .98
NDCG (at 4) - Example
1024
60
61
R- Precision
Precision at the R-th position in the ranking of results for a query that has R relevant documents.
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
L10Evaluation
62
Test Collections
Prasad
Creating Test Collectionsfor IR Evaluation
Prasad 63L10Evaluation
Introduction to Information RetrievalIntroduction to Information Retrieval
64
What we need for a benchmark A collection of documents
Documents must be representative of the documents we expect to see in reality.
A collection of information needs . . .which we will often incorrectly refer to as queries Information needs must be representative of the information
needs we expect to see in reality. Human relevance assessments
We need to hire/pay “judges” or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in
reality.64
Introduction to Information RetrievalIntroduction to Information Retrieval
65
Standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness
Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225
queries, exhaustive relevance judgments of all query-document-pairs
Too small, too untypical for serious IR evaluation today
65
Introduction to Information RetrievalIntroduction to Information Retrieval
66
Standard relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and
Technology (NIST) TREC is actually a set of several different relevance
benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations
between 1992 and 1999 1.89 million documents, mainly newswire articles, 450
information needs No exhaustive relevance judgments – too expensive Rather, NIST assessors’ relevance judgments are available
only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 66
Introduction to Information RetrievalIntroduction to Information Retrieval
67
Standard relevance benchmarks: Others GOV2
Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what
Google/Yahoo/MSN index NTCIR
East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF)
This evaluation series has concentrated on European languages and cross-language information retrieval.
Many others67
Introduction to Information RetrievalIntroduction to Information Retrieval
68
Validity of relevance assessments
Relevance assessments are only usable if they are consistent.
If they are not consistent, then there is no “truth” and experiments are not repeatable.
How can we measure this consistency or agreement among judges?
→ Kappa measure
68
69
Kappa measure for inter-judge (dis)agreement
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
P(A) – proportion of time judges agree P(E) – what agreement would be by chance Kappa = 0 for chance agreement, 1 for total agreement.
Prasad L10Evaluation
Kappa Measure: Example
Number of docs Judge 1 Judge 2
300 Relevant Relevant
70 Nonrelevant Nonrelevant
20 Relevant Nonrelevant
10 Nonrelevant Relevant
P(A)? P(E)?
Kappa Example
P(A) = 370/400 = 0.925 P(nonrelevant) = (10+20+70+70)/800 = 0.2125 P(relevant) = (10+20+300+300)/800 = 0.7878 P(E) = 0.2125^2 + 0.7878^2 = 0.665 Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
Kappa > 0.8 : good agreement 0.67< Kappa <0.8 : “tentative conclusions” (Carletta ’96) Depends on purpose of study
For >2 judges: average pairwise kappas 71
Kappa Example : Alternative view
Both judges score non-relevant randomly: 80/400 * 90/400 = 0.045
Both judges score relevant randomly: 320/400 * 310/400 = 0.62
Both judges agree = 0.045 + 0.62 = 0.665 Both judges disagree: 320/400 * 90/400 +
310/400 * 80/400 = 0.18 + 0.155 = 0.335 P(E) = 0.665 / (0.665 + 0.335) = 0.665
72
Evaluation at large search engines
Search engines have test collections of queries and hand-ranked results
Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10 . . . or measures that reward you more for getting rank 1
right than for getting rank 10 right. NDCG (Normalized Cumulative Discounted Gain)
Search engines also use non-relevance-based measures. Clickthrough on first result
Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.
Studies of user behavior in the lab A/B testing 73L10Evaluation
A/B testing
Purpose: Test a single innovation Prerequisite: You have a large search engine
up and running.
Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new
system that includes the innovation Evaluate with an “automatic” measure like clickthrough
on first result Now we can directly see if the innovation does improve
user happiness. Probably the evaluation methodology that large search
engines trust most
74
75
SKIP DETAILS
Prasad L10Evaluation
76
Other Evaluation Measures
Adapted from Slides Attributed to
Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Prasad L10Evaluation
77
Fallout Rate
Problems with both precision and recall: Number of irrelevant documents in the
collection is not taken into account. Recall is undefined when there is no relevant
document in the collection. Precision is undefined when no document is
retrieved.
collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.
Fallout
Prasad L10Evaluation
78
Subjective Relevance Measure
Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware.
Ability to find new information on a topic. Coverage Ratio: The proportion of relevant items retrieved
out of the total relevant documents known to a user prior to the search.
Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000).
Prasad L10Evaluation
79
Other Factors to Consider
User effort: Work required from the user in formulating queries, conducting the search, and screening the output.
Response time: Time interval between receipt of a user query and the presentation of system responses.
Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials.
Collection coverage: Extent to which any/all relevant items are included in the document corpus.
Prasad L10Evaluation
80
Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart)
Collection Number Of Number Of Raw Size Name Documents Queries (Mbytes) CACM 3,204 64 1.5 CISI 1,460 112 1.3 CRAN 1,400 225 1.6 MED 1,033 30 1.1 TIME 425 83 1.5
Different researchers used different test collections and evaluation techniques.
Early Test Collections
Prasad L10Evaluation
81
Critique of pure relevance
Relevance vs Marginal Relevance A document can be redundant even if it is highly
relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for
the user. Using facts/entities as evaluation units more
directly measures true relevance. But harder to create evaluation set
Prasad L10Evaluation