search and retrieval: relevance and evaluation prof. marti hearst sims 202, lecture 20

Search and Retrieval:Search and Retrieval:Relevance and EvaluationRelevance and Evaluation

Prof. Marti HearstProf. Marti Hearst

SIMS 202, Lecture 20SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Finding Out AboutFinding Out About

Three phases:Three phases: Asking of a question Construction of an answer Assessment of the answer

Part of an iterative processPart of an iterative process


TodayToday

RelevanceRelevance Evaluation of IR Systems Evaluation of IR Systems

Precision vs. Recall Cutoff Points Test Collections/TREC Blair & Maron Study


EvaluationEvaluation

Why Evaluate?Why Evaluate? What to Evaluate?What to Evaluate? How to Evaluate?How to Evaluate?


Why Evaluate?Why Evaluate?


Why Evaluate?Why Evaluate?

Determine if the system is Determine if the system is desirabledesirable

Make comparative assessmentsMake comparative assessments Others?Others?


What to Evaluate?What to Evaluate?



How much learned about the How much learned about the collectioncollection

How much learned about a topicHow much learned about a topic How much of the information need How much of the information need

is satisfiedis satisfied How inviting the system isHow inviting the system is



What can be measured that reflects What can be measured that reflects users’ ability to use system? users’ ability to use system? (Cleverdon 66)(Cleverdon 66)

Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall

proportion of relevant material actually retrieved Precision

proportion of retrieved material actually relevant

effectiveness


Assessing the AnswerAssessing the Answer

How well does it answer the How well does it answer the question?question? Complete answer? Partial? Background Information? Hints for further exploration?

How How relevantrelevant is it to the user? is it to the user?


RelevanceRelevance

SubjectiveSubjective Measurable to some extentMeasurable to some extent

How often do people agree a document is relevant to a query


RelevanceRelevance In what ways can a document be In what ways can a document be

relevant to a query?relevant to a query? Answer precise question precisely. Partially answer question. Suggest a source for more information. Give background information. Remind the user of other knowledge. Others ...


Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Retrieved



Relevant

Retrieved



Relevant

High PrecisionRetrieved



Relevant

High Recall

Retrieved



Relevant

High Precision

High Recall

Retrieved

|Rel|

|RelRet| Recall



Relevant

High Precision

High Recall

Retrieved

|Ret|

|RelRet| Precision


Why Precision and Recall?Why Precision and Recall?

Get as much good stuff while at the Get as much good stuff while at the same time getting as little junk as same time getting as little junk as possiblepossible

Standard IR EvaluationStandard IR Evaluation

PrecisionPrecision

RecallRecall

Collection

# relevant in collection

# retrieved

# relevant retrieved

# relevant retrieved

RetrievedDocumen

ts


Precision/Recall CurvesPrecision/Recall Curves

There is a tradeoff between Precision and There is a tradeoff between Precision and RecallRecall

So measure Precision at different levels of So measure Precision at different levels of RecallRecall

precision

recall

x

x

x

x



Difficult to determine which of these two Difficult to determine which of these two hypothetical results is better:hypothetical results is better:

precision

recall

x

x

x

x


Document Cutoff LevelsDocument Cutoff Levels

Another way to evaluate:Another way to evaluate: Fix the number of documents retrieved at

several levels: top 5, top 10, top 20, top 50, top 100, top 500

Measure precision at each of these levels Take (weighted) average over results

This is a way to focus on high precisionThis is a way to focus on high precision


The E-MeasureThe E-Measure

Combine Precision and Recall into one Combine Precision and Recall into one number number (van Rijsbergen 79)(van Rijsbergen 79)

RPb

PRPRbE

2

2

1

P = precisionR = recallb = measure of relative importance of P or R

For example,b = 0.5 means user is twice as interested in

precision as recall


Expected Search LengthExpected Search Length

Documents are presented in order of Documents are presented in order of predicted relevancepredicted relevance

Search length: number of non-relevant Search length: number of non-relevant documents that user must scan through documents that user must scan through in order to have their information need in order to have their information need satisfiedsatisfied

The shorter the betterThe shorter the better below: n=2, search length = 2; n=3, search length =

6 1 2 3 4 5 6 7 8 9 10 11 12 13n y n y y y y n y n n n n



EffectivenessEffectiveness Difficult to measure Recall and Precision are one way What might be others?


How to Evaluate?How to Evaluate?


TRECTREC Text REtrieval Conference/CompetitionText REtrieval Conference/Competition

Run by NIST (National Institute of Standards & Technology) 1997 was the 6th year

Collection: 3 Gigabytes, >1 Million DocsCollection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register)

Queries + Relevance JudgmentsQueries + Relevance Judgments Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved --

not entire collection! CompetitionCompetition

Various research and commercial groups compete Results judged on precision and recall, going up to a recall level of

1000 documents


Sample TREC queries Sample TREC queries (topics)(topics)<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.


TRECTREC Benefits:Benefits:

made research systems scale to large collections (pre-WWW)

allows for somewhat controlled comparisons Drawbacks:Drawbacks:

emphasis on high recall, which may be unrealistic for what most users want

very long queries, also unrealistic comparisons still difficult to make, because systems are

quite different on many dimensions focus on batch ranking rather than interaction no focus on the WWW


TREC is changingTREC is changing

Emphasis on specialized “tracks”Emphasis on specialized “tracks” Interactive track Natural Language Processing (NLP) track Multilingual tracks (Chinese, Spanish) Filtering track High-Precision High-Performance

www-nlpir.nist.gov/TRECwww-nlpir.nist.gov/TREC


TREC ResultsTREC Results Differ each yearDiffer each year For the main track:For the main track:

Best systems not statistically significantly different

Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account

Systems were optimized for longer queries and all performed worse for shorter, more realistic queries

Excitement is in the new tracksExcitement is in the new tracks


Blair and Maron 1985Blair and Maron 1985

A classic study of retrieval effectivenessA classic study of retrieval effectiveness earlier studies were on unrealistically small collections

Studied an archive of documents for a legal suitStudied an archive of documents for a legal suit ~350,000 pages of text 40 queries focus on high recall

Used IBM’s STAIRS full-text systemUsed IBM’s STAIRS full-text system Main Result: Main Result: System retrieved less than 20% of the System retrieved less than 20% of the

relevant documents for a particular information relevant documents for a particular information needs when lawyers thought they had 75%needs when lawyers thought they had 75%

But many queries had very high precisionBut many queries had very high precision


Blair and Maron, cont.Blair and Maron, cont.

How they estimated recallHow they estimated recall generated partially random samples of

unseen documents had users (unaware these were random)

judge them for relevance Other results:Other results:

two lawyers searches had similar performance

lawyers recall was not much different from paralegal’s


Blair and Maron, cont.Blair and Maron, cont.

Why recall was lowWhy recall was low users can’t foresee exact words and phrases

that will indicate relevant documents “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings

Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied


Still to come:Still to come:

Evaluating user interfacesEvaluating user interfaces

search and retrieval: relevance and evaluation prof. marti hearst sims 202, lecture 20

Documents

collection n

extent n

desirable n

satisfied n

topic n

n hints

marti hearst sims

answer n assessment