retrieval evaluation j. h. wang mar. 18, 2008. outline chap. 3, retrieval evaluation –retrieval...

38
Retrieval Evaluation J. H. Wang Mar. 18, 2008

Upload: rocio-bullock

Post on 14-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Retrieval Evaluation

J. H. WangMar. 18, 2008

Page 2: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Outline

• Chap. 3, Retrieval Evaluation– Retrieval Performance Evaluation– Reference Collections

Page 3: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Introduction

• Types of evaluation– Functional analysis phase, and error analysis

phase– Performance evaluation

• Performance evaluation– Response time/space required

• Retrieval performance evaluation– The evaluation of how precise is the answer set

Page 4: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Retrieval Performance Evaluation

• Query in batch mode vs. interactive sessions

collection

Relevant DocsIn Answer Set

|Ra|

Relevant Docs|R|

Answer Set|A|

Recall=|Ra|/|R|

Precision=|Ra|/|A|

Sorted by relevance

Page 5: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Precision versus Recall Curve• Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123}

• P=100% at R=10%• P= 66% at R=20%• P= 50% at R=30%

Ranking for query q:

1.d123*2.d84

3.d56*4.d6

5.d8

6.d9*7.d511

8.d129

9.d187

10.d25*

11.d38

12.d48

13.d250

14.d11

15.d3*

Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

Page 6: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Precision versus Recall Curve

• For a single query

Fig3.2

Page 7: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Average Over Multiple Queries

• P(r)=average precision at the recall level r

• Nq= Number of queries used

• Pi(r)=The precision at recall level r for the i-th query

qN

ii

q

rPN

rP1

)(1

)(

Page 8: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Interpolated Precision

• Rq={d3,d56,d129}

• P=33% at R=33%• P=25% at R=66%• P=20% at R=100%

• P(rj)=max ri≦ r≦ rj+1P(r)

1.d123

2.d84

3.d56*4.d6

5.d8

6.d9

7.d511

8.d129*

9.d187

10.d25

11.d38

12.d48

13.d250

14.d11

15.d3*

Page 9: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Interpolated Precision

• Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level

• P(rj)=max ri≦ r≦ rj+1P(r)R=30%, P3(r)~P4(r)=33%R=40%, P4(r)~P5(r)R=50%, P5(r)~P6(r)R=60%, P6(r)~P7(r)=25%

Page 10: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Average Recall vs. Precision Figure

Page 11: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Single Value Summaries

• Average precision versus recall– Compare retrieval algorithms over a set of

example queries• Sometimes we need to compare individual

query’s performance – Averaging precision over many queries might

disguise important anomalies in the retrieval algorithms

– We might be interested in whether one of them outperforms the other for each query

• Need a single value summary– The single value should be interpreted as a

summary of the corresponding precision versus recall curve

Page 12: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Single Value Summaries

• Average Precision at Seen Relevant Documents– Averaging the precision figures obtained after each

new relevant document is observed– Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57– This measure favors systems which retrieve relevant

documents quickly (i.e., early in the ranking)

• R-Precision– The precision at the R-th position in the ranking– R: the total number of relevant documents of the current query

(number of documents in Rq)– Fig3.2: R=10, value=0.4– Fig3.3: R=3, value=0.33

Page 13: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Precision Histograms

• Use R-precision measures to compare the retrieval history of two algorithms through visual inspection

• RPA/B(i)=RPA(i)-RPB(i)

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3 4 5 6 7 8 9 10

Query Numbaer

R-P

recision

A/B

Page 14: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Summary Table Statistics

• Single value measures can be stored in a table regarding the set of all queries– the number of queries– total number of documents retrieved by all

queries– total number of relevant documents which

were effectively retrieved when all queries are considered

– total number of relevant documents which could have been retrieved by all queries

– …

Page 15: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Precision and Recall Appropriateness

• Proper estimation of maximum recall for a query requires knowledge of all documents in the collection

• Recall and precision are related measures which capture different aspects of the documents

• Measures which quantify the informativeness of the retrieval process might be more appropriate

• Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

Page 16: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Alternative Measures

• The Harmonic Mean– Values in [0,1]

• The E Measure

– Relative importance of recall and precision– b=1, E(j)=F(j)– b>1, more interested in precision– b<1, more interested in recall

)(1

)(1

2)(

jPjr

jF

)(1

)(

2

2

11)(

jPjrb

bjE

Page 17: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

User-Oriented Measure

• Assumption: different users might have a different interpretation of which document is relevant

Page 18: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

User-Oriented Measure

• Coverage=|Rk|/|U|• Novelty=|Ru|/(|Ru|+|Rk|)• A high coverage ratio indicates that the

system is finding most of the relevant documents that the user expected to see

• A high novelty ratio indicates that the system is revealing many new documents which were previously unknown

Page 19: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Other Measures

• Relative recall: the ratio between the number of relevant documents found and the number of relevant documents the user expected to find

• Recall effort: the ratio between the number of relevant documents the user expected to find and the number of documents examined

• Others: expected search length, satisfaction, frustration

Page 20: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Reference Collections

• Reference test collections for the evaluation of IR systems– TIPSTER/TREC: large size, thorough

experimentation– CACM, ISI: historical importance– Cystic Fibrosis: small collections,

extensively studied by specialists before generation of relevant documents

Page 21: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Criticisms for IR Research

• Lacks a solid formal framework as a basic foundation– It’s difficult to dismiss due to the

subjectiveness associated with the task of deciding on the relevance of a document

• Lacks robust and consistent testbeds and benchmarks– Early experimentation was based on relatively

small test collections, and there were no widely accepted benchmarks

– In early 1990s, TREC conference under Donna Harman (NIST) dedicated to experimentation with a large test collection

Page 22: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

TREC (Text REtrieval Conference)

• Initiated under the National Institute of Standards and Technology(NIST)

• Goals:– Providing a large test collection– Uniform scoring procedures– Forum for comparing results

• 7th TREC conference in 1998– Document collection: test collections, example

information requests (topics), relevant docs– The benchmarks tasks

Page 23: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The Documents Collection

• Tagged with SGML to allow easy parsing

<doc>

<docno>WSJ880406-0090</docno>

<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>

<author>Janet GuyonWSJ Staff)</author>

<dateline>New York</dateline>

<text>

American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…

</text>

</doc>

<doc>

<docno>WSJ880406-0090</docno>

<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>

<author>Janet GuyonWSJ Staff)</author>

<dateline>New York</dateline>

<text>

American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…

</text>

</doc>

Page 24: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

TREC1-6 DocumentsDisk Contents Size Number Words/Doc. Words/Doc.

Mb Docs (Median) (Mean)1 WSJ, 1987-1989 267 98,732 245 434

AP, 1989 254 84,678 446 473.9ZIFF 242 75,180 200 473FR, 1989 260 25,960 391 1315.9DOE 184 226,087 111 120.4

2 WSJ, 1990-1992 242 74,520 301 508.4AP, 1988 237 79,919 438 468.7ZIFF 175 56,920 182 451.9FR, 1988 209 19,860 396 1378.1

3 SJMN, 1991 287 90,257 379 453AP, 1990 237 78,321 451 478.4ZIFF 345 161,021 122 295.4PAT, 1993 243 6,711 4,445 5391

4 FT, 1991-1994 564 210,158 316 412.7FR, 1994 395 55,630 588 644.7CR, 1993 235 27,922 288 1373.5

5 FBIS 470 130,471 322 543.6LAT 475 131,896 351 526.5

6 FBIS 490 120,653 348 581.3

Page 25: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The Example Information Requests (Topics)

• Each request (topic) is a description of an information need in natural language

• Topic number for different topics<top>

<num> Number:168

<title>Topic:Financing AMTRAK

<desc>Description:

…..

<nar>Narrative:A …..

</top>

Page 26: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

TREC~ Topics

# words (incl. stopwords)

Field Min. Max.

Avg.

Total 44 250 107.4

Title 1 11 3.8

Description 5 41 17.9

Narrative 23 209 64.5

TREC-1

(51-100)

Concepts 4 111 21.2

Total 54 231 130.8

Title 2 9 4.9

Description 6 41 18.7

Narrative 27 165 78.8

TREC-2

(101-150)

Concepts 3 88 28.5

Total 49 180 103.4

Title 2 20 6.5

Description 9 42 22.3

TREC-3

(151-200)

Narrative 26 146 74.6

Total 8 33 16.3 TREC-4

(201-250) Description 8 33 16.3

Total 29 213 82.7

Title 2 10 3.8

Description 6 40 15.7

TREC-5

(251-300)

Narrative 19 168 63.2

Total 47 156 88.4

Title 1 5 2.7

Description 5 62 20.4

TREC-6

(301-350)

Narrative 17 142 65.3

Page 27: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

TREC ~ Relevance Assessment

• Relevance assessment– Pooling Method– The documents in the pool are shown to

human assessor to decide on the relevance

• Two assumptions– Vast majority of the relevant documents is

collected in the assembled pool– Documents that are not in the pool can be

considered to be not relevant

Page 28: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Pooling Method

• The set of relevant documents for each example information request is obtained from a pool of possible relevant documents– This pool is created by taking the top

K documents (usually, K=100) in the rankings generated by the various participating retrieval systems

Page 29: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The (Benchmark) Tasks at the TREC Conferences

• Ad hoc task– Receive new requests and execute them on a

pre-specified document collection

• Routing task– Receive test info. requests, two document

collections– First doc: training and tuning retrieval

algorithm– Second doc: testing the tuned retrieval

algorithm

Page 30: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Other Tracks

• *Chinese• Filtering • Interactive• *NLP (natural language processing)• Cross languages• High precision• Spoken document retrieval• Query (TREC-7)• Others: Web, Terabyte, SPAM, Blog,

Novelty, Question Answering, HARD, …

Page 31: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

TREC ~ Evaluation

Tasks/Tracks TREC1 TREC2 TREC3 TREC4 TREC5 TREC6 TREC7

Routing Main Tasks

Adhoc

Confusion Confusion Spoken Document

Retrieval

Database Merging

Filtering

High Precision

Interactive

Cross Language

Spanish Multilingual

Chinese

Natural Language Processing

Query

Very Large Corpus

Page 32: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Evaluation Measures at the TREC Conferences

• Summary table statistics • Recall-precision• Document level averages*• Average precision histogram

Page 33: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The CACM Collection

• Small collections about computer science literature (1958-1979)

• Text of 3,204 documents• Structured subfields

– word stems from the title and abstract sections– Categories– direct references between articles: a list of document pairs

[da,db]– Bibliographic coupling connections: a list of triples

[d1,d2,ncited]– Number of co-citations for each pair of articles [d1,d2,nciting]

• A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

Page 34: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

• CACM collection also includes a set of 52 test information requests– Ex: “What articles exist which deal with TSS

(Time Sharing System), an operating system for IBM computers?”

• Also includes two Boolean query formulations and a set of relevant documents

• Since the requests are fairly specific, the average number of relevant documents for each request is small (around 15)

• Precision and recall tend to be low

Page 35: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The ISI Collection

• The 1,460 documents in the ISI test collection were selected from a previous collection assembled by Small at ISI (Institute of Scientific Information)

• The documents selected were those most cited in a cross-citation study done by Small

• The main purpose is to support investigation of similarities based on terms and on cross-citation patterns

Page 36: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

The Cystic Fibrosis (CF) Collection

• 1,239 documents indexed with the term “cystic fibrosis” (“ ”囊狀纖維化 ) in Medline database

• Information requests were generated by an expert in cystic fibrosis

• Relevance scores were provided by subject experts– 0: non-relevance– 1: marginal relevance– 2: high relevance

Page 37: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Characteristics of CF collection

• Relevance score was generated directly by human experts

• It includes a good number of information requests (relative to the collection size)– The respective query vectors present

overlap among themselves– This allows experimentation with retrieval

strategies which take advantage of past query sessions to improve retrieval performance

Page 38: Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Trends and Research Issues

• Interactive user interface– A general belief: effective retrieval is

highly dependent on obtaining proper feedback from the user

– Deciding which evaluation measures are most appropriate in this scenario• Ex: informativeness measure in 1992

• The proposal, the study, the characterization of alternative measures to recall and precision