search and retrieval: relevance and evaluation prof. marti hearst sims 202, lecture 20
TRANSCRIPT
Search and Retrieval:Search and Retrieval:Relevance and EvaluationRelevance and Evaluation
Prof. Marti HearstProf. Marti Hearst
SIMS 202, Lecture 20SIMS 202, Lecture 20
Marti A. HearstSIMS 202, Fall 1997
Finding Out AboutFinding Out About
Three phases:Three phases: Asking of a question Construction of an answer Assessment of the answer
Part of an iterative processPart of an iterative process
Marti A. HearstSIMS 202, Fall 1997
TodayToday
RelevanceRelevance Evaluation of IR Systems Evaluation of IR Systems
Precision vs. Recall Cutoff Points Test Collections/TREC Blair & Maron Study
Marti A. HearstSIMS 202, Fall 1997
EvaluationEvaluation
Why Evaluate?Why Evaluate? What to Evaluate?What to Evaluate? How to Evaluate?How to Evaluate?
Marti A. HearstSIMS 202, Fall 1997
Why Evaluate?Why Evaluate?
Marti A. HearstSIMS 202, Fall 1997
Why Evaluate?Why Evaluate?
Determine if the system is Determine if the system is desirabledesirable
Make comparative assessmentsMake comparative assessments Others?Others?
Marti A. HearstSIMS 202, Fall 1997
What to Evaluate?What to Evaluate?
Marti A. HearstSIMS 202, Fall 1997
What to Evaluate?What to Evaluate?
How much learned about the How much learned about the collectioncollection
How much learned about a topicHow much learned about a topic How much of the information need How much of the information need
is satisfiedis satisfied How inviting the system isHow inviting the system is
Marti A. HearstSIMS 202, Fall 1997
What to Evaluate?What to Evaluate?
What can be measured that reflects What can be measured that reflects users’ ability to use system? users’ ability to use system? (Cleverdon 66)(Cleverdon 66)
Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall
proportion of relevant material actually retrieved Precision
proportion of retrieved material actually relevant
effectiveness
Marti A. HearstSIMS 202, Fall 1997
Assessing the AnswerAssessing the Answer
How well does it answer the How well does it answer the question?question? Complete answer? Partial? Background Information? Hints for further exploration?
How How relevantrelevant is it to the user? is it to the user?
Marti A. HearstSIMS 202, Fall 1997
RelevanceRelevance
SubjectiveSubjective Measurable to some extentMeasurable to some extent
How often do people agree a document is relevant to a query
Marti A. HearstSIMS 202, Fall 1997
RelevanceRelevance In what ways can a document be In what ways can a document be
relevant to a query?relevant to a query? Answer precise question precisely. Partially answer question. Suggest a source for more information. Give background information. Remind the user of other knowledge. Others ...
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Retrieved
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Relevant
Retrieved
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Relevant
High PrecisionRetrieved
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Relevant
High Recall
Retrieved
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Relevant
High Precision
High Recall
Retrieved
|Rel|
|RelRet| Recall
Marti A. HearstSIMS 202, Fall 1997
Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments
Relevant
High Precision
High Recall
Retrieved
|Ret|
|RelRet| Precision
Marti A. HearstSIMS 202, Fall 1997
Why Precision and Recall?Why Precision and Recall?
Get as much good stuff while at the Get as much good stuff while at the same time getting as little junk as same time getting as little junk as possiblepossible
Standard IR EvaluationStandard IR Evaluation
PrecisionPrecision
RecallRecall
Collection
# relevant in collection
# retrieved
# relevant retrieved
# relevant retrieved
RetrievedDocumen
ts
Marti A. HearstSIMS 202, Fall 1997
Precision/Recall CurvesPrecision/Recall Curves
There is a tradeoff between Precision and There is a tradeoff between Precision and RecallRecall
So measure Precision at different levels of So measure Precision at different levels of RecallRecall
precision
recall
x
x
x
x
Marti A. HearstSIMS 202, Fall 1997
Precision/Recall CurvesPrecision/Recall Curves
Difficult to determine which of these two Difficult to determine which of these two hypothetical results is better:hypothetical results is better:
precision
recall
x
x
x
x
Precision/Recall CurvesPrecision/Recall Curves
Marti A. HearstSIMS 202, Fall 1997
Document Cutoff LevelsDocument Cutoff Levels
Another way to evaluate:Another way to evaluate: Fix the number of documents retrieved at
several levels: top 5, top 10, top 20, top 50, top 100, top 500
Measure precision at each of these levels Take (weighted) average over results
This is a way to focus on high precisionThis is a way to focus on high precision
Marti A. HearstSIMS 202, Fall 1997
The E-MeasureThe E-Measure
Combine Precision and Recall into one Combine Precision and Recall into one number number (van Rijsbergen 79)(van Rijsbergen 79)
RPb
PRPRbE
2
2
1
P = precisionR = recallb = measure of relative importance of P or R
For example,b = 0.5 means user is twice as interested in
precision as recall
Marti A. HearstSIMS 202, Fall 1997
Expected Search LengthExpected Search Length
Documents are presented in order of Documents are presented in order of predicted relevancepredicted relevance
Search length: number of non-relevant Search length: number of non-relevant documents that user must scan through documents that user must scan through in order to have their information need in order to have their information need satisfiedsatisfied
The shorter the betterThe shorter the better below: n=2, search length = 2; n=3, search length =
6 1 2 3 4 5 6 7 8 9 10 11 12 13n y n y y y y n y n n n n
Marti A. HearstSIMS 202, Fall 1997
What to Evaluate?What to Evaluate?
EffectivenessEffectiveness Difficult to measure Recall and Precision are one way What might be others?
Marti A. HearstSIMS 202, Fall 1997
How to Evaluate?How to Evaluate?
Marti A. HearstSIMS 202, Fall 1997
TRECTREC Text REtrieval Conference/CompetitionText REtrieval Conference/Competition
Run by NIST (National Institute of Standards & Technology) 1997 was the 6th year
Collection: 3 Gigabytes, >1 Million DocsCollection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register)
Queries + Relevance JudgmentsQueries + Relevance Judgments Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved --
not entire collection! CompetitionCompetition
Various research and commercial groups compete Results judged on precision and recall, going up to a recall level of
1000 documents
Marti A. HearstSIMS 202, Fall 1997
Sample TREC queries Sample TREC queries (topics)(topics)<num> Number: 168<title> Topic: Financing AMTRAK
<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)
<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.
Marti A. HearstSIMS 202, Fall 1997
TRECTREC Benefits:Benefits:
made research systems scale to large collections (pre-WWW)
allows for somewhat controlled comparisons Drawbacks:Drawbacks:
emphasis on high recall, which may be unrealistic for what most users want
very long queries, also unrealistic comparisons still difficult to make, because systems are
quite different on many dimensions focus on batch ranking rather than interaction no focus on the WWW
Marti A. HearstSIMS 202, Fall 1997
TREC is changingTREC is changing
Emphasis on specialized “tracks”Emphasis on specialized “tracks” Interactive track Natural Language Processing (NLP) track Multilingual tracks (Chinese, Spanish) Filtering track High-Precision High-Performance
www-nlpir.nist.gov/TRECwww-nlpir.nist.gov/TREC
Marti A. HearstSIMS 202, Fall 1997
TREC ResultsTREC Results Differ each yearDiffer each year For the main track:For the main track:
Best systems not statistically significantly different
Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account
Systems were optimized for longer queries and all performed worse for shorter, more realistic queries
Excitement is in the new tracksExcitement is in the new tracks
Marti A. HearstSIMS 202, Fall 1997
Blair and Maron 1985Blair and Maron 1985
A classic study of retrieval effectivenessA classic study of retrieval effectiveness earlier studies were on unrealistically small collections
Studied an archive of documents for a legal suitStudied an archive of documents for a legal suit ~350,000 pages of text 40 queries focus on high recall
Used IBM’s STAIRS full-text systemUsed IBM’s STAIRS full-text system Main Result: Main Result: System retrieved less than 20% of the System retrieved less than 20% of the
relevant documents for a particular information relevant documents for a particular information needs when lawyers thought they had 75%needs when lawyers thought they had 75%
But many queries had very high precisionBut many queries had very high precision
Marti A. HearstSIMS 202, Fall 1997
Blair and Maron, cont.Blair and Maron, cont.
How they estimated recallHow they estimated recall generated partially random samples of
unseen documents had users (unaware these were random)
judge them for relevance Other results:Other results:
two lawyers searches had similar performance
lawyers recall was not much different from paralegal’s
Marti A. HearstSIMS 202, Fall 1997
Blair and Maron, cont.Blair and Maron, cont.
Why recall was lowWhy recall was low users can’t foresee exact words and phrases
that will indicate relevant documents “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings
Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied
Marti A. HearstSIMS 202, Fall 1997
Still to come:Still to come:
Evaluating user interfacesEvaluating user interfaces