search and retrieval: relevance and evaluation prof. marti hearst sims 202, lecture 20

37
Search and Retrieval: Search and Retrieval: Relevance and Evaluation Relevance and Evaluation Prof. Marti Hearst Prof. Marti Hearst SIMS 202, Lecture 20 SIMS 202, Lecture 20

Upload: martha-park

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Search and Retrieval:Search and Retrieval:Relevance and EvaluationRelevance and Evaluation

Prof. Marti HearstProf. Marti Hearst

SIMS 202, Lecture 20SIMS 202, Lecture 20

Page 2: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Finding Out AboutFinding Out About

Three phases:Three phases: Asking of a question Construction of an answer Assessment of the answer

Part of an iterative processPart of an iterative process

Page 3: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

TodayToday

RelevanceRelevance Evaluation of IR Systems Evaluation of IR Systems

Precision vs. Recall Cutoff Points Test Collections/TREC Blair & Maron Study

Page 4: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

EvaluationEvaluation

Why Evaluate?Why Evaluate? What to Evaluate?What to Evaluate? How to Evaluate?How to Evaluate?

Page 5: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Why Evaluate?Why Evaluate?

Page 6: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Why Evaluate?Why Evaluate?

Determine if the system is Determine if the system is desirabledesirable

Make comparative assessmentsMake comparative assessments Others?Others?

Page 7: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

What to Evaluate?What to Evaluate?

Page 8: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

What to Evaluate?What to Evaluate?

How much learned about the How much learned about the collectioncollection

How much learned about a topicHow much learned about a topic How much of the information need How much of the information need

is satisfiedis satisfied How inviting the system isHow inviting the system is

Page 9: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

What to Evaluate?What to Evaluate?

What can be measured that reflects What can be measured that reflects users’ ability to use system? users’ ability to use system? (Cleverdon 66)(Cleverdon 66)

Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall

proportion of relevant material actually retrieved Precision

proportion of retrieved material actually relevant

effectiveness

Page 10: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Assessing the AnswerAssessing the Answer

How well does it answer the How well does it answer the question?question? Complete answer? Partial? Background Information? Hints for further exploration?

How How relevantrelevant is it to the user? is it to the user?

Page 11: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

RelevanceRelevance

SubjectiveSubjective Measurable to some extentMeasurable to some extent

How often do people agree a document is relevant to a query

Page 12: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

RelevanceRelevance In what ways can a document be In what ways can a document be

relevant to a query?relevant to a query? Answer precise question precisely. Partially answer question. Suggest a source for more information. Give background information. Remind the user of other knowledge. Others ...

Page 13: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Retrieved

Page 14: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Relevant

Retrieved

Page 15: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Relevant

High PrecisionRetrieved

Page 16: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Relevant

High Recall

Retrieved

Page 17: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Relevant

High Precision

High Recall

Retrieved

|Rel|

|RelRet| Recall

Page 18: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Retrieved vs. Relevant Retrieved vs. Relevant DocumentsDocuments

Relevant

High Precision

High Recall

Retrieved

|Ret|

|RelRet| Precision

Page 19: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Why Precision and Recall?Why Precision and Recall?

Get as much good stuff while at the Get as much good stuff while at the same time getting as little junk as same time getting as little junk as possiblepossible

Page 20: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Standard IR EvaluationStandard IR Evaluation

PrecisionPrecision

RecallRecall

Collection

# relevant in collection

# retrieved

# relevant retrieved

# relevant retrieved

RetrievedDocumen

ts

Page 21: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Precision/Recall CurvesPrecision/Recall Curves

There is a tradeoff between Precision and There is a tradeoff between Precision and RecallRecall

So measure Precision at different levels of So measure Precision at different levels of RecallRecall

precision

recall

x

x

x

x

Page 22: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Precision/Recall CurvesPrecision/Recall Curves

Difficult to determine which of these two Difficult to determine which of these two hypothetical results is better:hypothetical results is better:

precision

recall

x

x

x

x

Page 23: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Precision/Recall CurvesPrecision/Recall Curves

Page 24: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Document Cutoff LevelsDocument Cutoff Levels

Another way to evaluate:Another way to evaluate: Fix the number of documents retrieved at

several levels: top 5, top 10, top 20, top 50, top 100, top 500

Measure precision at each of these levels Take (weighted) average over results

This is a way to focus on high precisionThis is a way to focus on high precision

Page 25: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

The E-MeasureThe E-Measure

Combine Precision and Recall into one Combine Precision and Recall into one number number (van Rijsbergen 79)(van Rijsbergen 79)

RPb

PRPRbE

2

2

1

P = precisionR = recallb = measure of relative importance of P or R

For example,b = 0.5 means user is twice as interested in

precision as recall

Page 26: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Expected Search LengthExpected Search Length

Documents are presented in order of Documents are presented in order of predicted relevancepredicted relevance

Search length: number of non-relevant Search length: number of non-relevant documents that user must scan through documents that user must scan through in order to have their information need in order to have their information need satisfiedsatisfied

The shorter the betterThe shorter the better below: n=2, search length = 2; n=3, search length =

6 1 2 3 4 5 6 7 8 9 10 11 12 13n y n y y y y n y n n n n

Page 27: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

What to Evaluate?What to Evaluate?

EffectivenessEffectiveness Difficult to measure Recall and Precision are one way What might be others?

Page 28: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

How to Evaluate?How to Evaluate?

Page 29: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

TRECTREC Text REtrieval Conference/CompetitionText REtrieval Conference/Competition

Run by NIST (National Institute of Standards & Technology) 1997 was the 6th year

Collection: 3 Gigabytes, >1 Million DocsCollection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register)

Queries + Relevance JudgmentsQueries + Relevance Judgments Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved --

not entire collection! CompetitionCompetition

Various research and commercial groups compete Results judged on precision and recall, going up to a recall level of

1000 documents

Page 30: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Sample TREC queries Sample TREC queries (topics)(topics)<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

Page 31: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

TRECTREC Benefits:Benefits:

made research systems scale to large collections (pre-WWW)

allows for somewhat controlled comparisons Drawbacks:Drawbacks:

emphasis on high recall, which may be unrealistic for what most users want

very long queries, also unrealistic comparisons still difficult to make, because systems are

quite different on many dimensions focus on batch ranking rather than interaction no focus on the WWW

Page 32: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

TREC is changingTREC is changing

Emphasis on specialized “tracks”Emphasis on specialized “tracks” Interactive track Natural Language Processing (NLP) track Multilingual tracks (Chinese, Spanish) Filtering track High-Precision High-Performance

www-nlpir.nist.gov/TRECwww-nlpir.nist.gov/TREC

Page 33: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

TREC ResultsTREC Results Differ each yearDiffer each year For the main track:For the main track:

Best systems not statistically significantly different

Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account

Systems were optimized for longer queries and all performed worse for shorter, more realistic queries

Excitement is in the new tracksExcitement is in the new tracks

Page 34: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Blair and Maron 1985Blair and Maron 1985

A classic study of retrieval effectivenessA classic study of retrieval effectiveness earlier studies were on unrealistically small collections

Studied an archive of documents for a legal suitStudied an archive of documents for a legal suit ~350,000 pages of text 40 queries focus on high recall

Used IBM’s STAIRS full-text systemUsed IBM’s STAIRS full-text system Main Result: Main Result: System retrieved less than 20% of the System retrieved less than 20% of the

relevant documents for a particular information relevant documents for a particular information needs when lawyers thought they had 75%needs when lawyers thought they had 75%

But many queries had very high precisionBut many queries had very high precision

Page 35: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Blair and Maron, cont.Blair and Maron, cont.

How they estimated recallHow they estimated recall generated partially random samples of

unseen documents had users (unaware these were random)

judge them for relevance Other results:Other results:

two lawyers searches had similar performance

lawyers recall was not much different from paralegal’s

Page 36: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Blair and Maron, cont.Blair and Maron, cont.

Why recall was lowWhy recall was low users can’t foresee exact words and phrases

that will indicate relevant documents “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings

Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Page 37: Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. HearstSIMS 202, Fall 1997

Still to come:Still to come:

Evaluating user interfacesEvaluating user interfaces