de conferentie 2004 pia borlund

22
Pia Borlund Department of Information Studies Royal School of Library and Information Science, Denmark [email protected] The IIR evaluation model: a framework for evaluation of interactive information retrieval (IIR) systems DEN conference, Arnhem, December 1-2 2004

Upload: digitaal-erfgoedconferentie

Post on 13-Jan-2015

450 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: DE Conferentie 2004 Pia Borlund

Pia Borlund

Department of Information StudiesRoyal School of Library and Information Science, Denmark

[email protected]

The IIR evaluation model:a framework for evaluation of

interactive information retrieval (IIR) systems

DEN conference, Arnhem, December 1-2 2004

Page 2: DE Conferentie 2004 Pia Borlund

Outline:

• Motivation for the development of the IIR evaluation model

• The IIR evaluation model

• Objective & parts of the IIR evaluation model

• Part 1: test components

• Part 2: recommendations

• Part 3: alternative performance measures

• Strengths and weaknesses of the IIR evaluation model

• How can the cultural heritage sector use the IIR evaluation model?

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 3: DE Conferentie 2004 Pia Borlund

• The primary objective of IR systems evaluation is the measurement

of effectiveness

…is about how well system is at retrieving all the relevant documents,

and at the same time retrieving as few non-relevant documents

…is a measure of how well the system performs – hence, performance

measures

Objective of IR systems evaluation:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 4: DE Conferentie 2004 Pia Borlund

Simple illustration of the parts/agents involved in the IR process:

IR system: documents/

representations

IR system: documents/

representations

Intermediary(person or interface)

User with information

needQ R

Q: QueryR: Request

The objective of the IR process is to obtain an appropriate match

(harmony) between the involved parts/agents as to the satisfaction

of the user’s information need

Main approaches to IR research,and IR systems evaluation:

System-driven approach to IR User-oriented approach to IR

Cognitive viewpoint

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 5: DE Conferentie 2004 Pia Borlund

The Cranfield model:

The Cranfield model derives directly from the Cranfield II experiment:

‘principle of test collections’ (a collection of documents; a collection of queries; and a collection of relevance assessments)

recall/precision

Test characteristics:

• Controlled laboratory test, no user participation

• (Requests)/queries = information need

• Batch mode (1 search run) = static information need, and relevance

• Objective, topical, and binary relevance

• Experimental control

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 6: DE Conferentie 2004 Pia Borlund

“The conflict between laboratory and operational

experiments is essentially a conflict between, on the

one hand, control over experimental variables,

observability, and repeatability, and on the other hand,

realism.”

(Robertson & Hancock-Beaulieu, 1992, p. 460)

System-driven vs.user-oriented approach:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 7: DE Conferentie 2004 Pia Borlund

IR systems have become more interactive!! Criticisms against the conventional methods, e.g.:

• Real end-users are rarely involved • The information need is assumed static throughout the

experiment• Binary and topical relevance types are used

1) The application of simulated work task situations: Realistic information searching & retrieval processes Experimental control

2) To measure performance by use of non-binary based performance measures:

Realistic assessment behaviour Indication of users’ subjective impression of system

performance and satisfaction of information need

Motivation for the IIR model:Facts:

Solution:

Pia B

orlund DE

N conference, A

rnhem, D

ecember 1-2 2004

Page 8: DE Conferentie 2004 Pia Borlund

Outline:

Motivation for the development of the IIR evaluation model

• The IIR evaluation model

• Objective & parts of the IIR evaluation model

• Part 1: test components

• Part 2: recommendations

• Part 3: alternative performance measures

• Strengths and weaknesses of the IIR evaluation model

• How can the cultural heritage sector use the IIR evaluation model?

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 9: DE Conferentie 2004 Pia Borlund

The aim of the IIR evaluation model is two-fold:

1) To facilitate evaluation of IIR systems as realistically as possible with reference to actual information searching and retrieval processes, though still in a relatively controlled evaluation environment; and

2) To calculate the IIR system performance taking into account the non-binary nature of the assigned relevance assessments and respecting the different types of relevance

Objective of the IIR evaluation model:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 10: DE Conferentie 2004 Pia Borlund

The model consists of 3 parts:

1) A set of test components which aims at ensuring a functional, valid, and realistic setting for the evaluation of IIR systems

2) Empirical based recommendations for the application of the proposed sub-component, the concept of simulated work task situations; and

3) Alternative performance measures, e.g.,:

• The measure of Relative Relevance (RR)

• The performance indicator of Ranked Half-Life (RHL)

• The measure of Cumulative Gain (CG)

• The measure of Cumulative Gain with Discount (DCG)

Parts of the IIR evaluation model:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 11: DE Conferentie 2004 Pia Borlund

Outline:

Motivation for the development of the IIR evaluation model

• The IIR evaluation model

Objective & parts of the IIR evaluation model

• Part 1: test components

• Part 2: recommendations

• Part 3: alternative performance measures

• Strengths and weaknesses of the IIR evaluation model

• How can the cultural heritage sector use the IIR evaluation model?

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 12: DE Conferentie 2004 Pia Borlund

• The involvement of potential users as test persons

• The application of individual and potentially dynamic information need interpretations deriving from, e.g., simulated work task situations; and

• The assignment of multidimensional and dynamic relevance judgements

Part 1: test components:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 13: DE Conferentie 2004 Pia Borlund

… for the application of simulated work task situations: 

• To employ both the simulated, and real information needs within the same test

• To tailor the simulated work task situations toward the test persons with reference to:

• a situation the test persons can relate to easily and with which they can identify themselves;

• a situation that the test persons find topically interesting; and

• a situation that provides enough imaginative context in order for the test persons to be able to apply the situation

• To permute the order of search jobs between the test persons

• To pilot test prior to actual testing

Part 2: recommendations:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 14: DE Conferentie 2004 Pia Borlund

… motivation:

Evaluation of IIR systems – the involvement of human beings:1) Mixture of different types of objective and subjective relevance

assessments2) The assignment of non-binary relevance assessments3) Scattered distributions of user-generated relevance assessments

Relative Relevance (RR) measure:Associative measure of agreement between types of relevance

Ranked Half-Life (RHL) indicator:Positional indicator of ranked retrieval results

…The proposed measures are to supplement, not substitute recall and precision

Part 3:alternative performance measures:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 15: DE Conferentie 2004 Pia Borlund

Strengths and weaknesses of the IIR evaluation model:

Strengths:• Realism• IR + searching behaviour • Real + simulated information

needs• Subjective, non-binary, potentially

dynamic relevance• Alternative performance

measures + recall/precision• Experimental control• Repeatable, but not necessarily

with identical results

Weaknesses:• Resource demanding

(manpower + time)• Requires domain knowledge• Requires design and test of

simulated work task situations• Lack of comparability of

performance measure results – due to subjective assessments

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 16: DE Conferentie 2004 Pia Borlund

Outline:

Motivation for the development of the IIR evaluation model

The IIR evaluation model

Objective & parts of the IIR evaluation model

Part 1: test components

Part 2: recommendations

Part 3: alternative performance measures

Strengths and weaknesses of the IIR evaluation model

• How can the cultural heritage sector use the IIR evaluation model?

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 17: DE Conferentie 2004 Pia Borlund

How can the cultural heritage sector use the IIR evaluation model?

• Investigation of information seeking and searching behaviour of

cultural heritage by use simulated work task situations

• Specification of requirements for new systems by use of

simulated work task situations based on information needs and

information seeking/searching behaviour

• Performance and/or usability tests of existing systems -- the

complete model, or parts of the model

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 18: DE Conferentie 2004 Pia Borlund

Thank you !!

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 19: DE Conferentie 2004 Pia Borlund

References:Borlund, P. (2000). Evaluation of interactive information retrieval

systems. Åbo: Åbo Akademi University Press. Doctoral thesis.Cleverdon, C.W. and Keen, E.M. (1966). Aslib Cranfield research

project: factors determining the performance of indexing systems. Vol. 2: results. Cranfield.

Cleverdon, C.W., Mills, J. and Keen, E.M. (1966). Aslib Cranfield research project: Factors determining the performance of indexing systems. Vol. 1: design. Cranfield.

Järvelin, K. and Kekäläinen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd ACM Sigir Conference on Research and Development of Information Retrieval. Athens, Greece, 2000. Pp. 41-48. New York, N.Y.: ACM Press.

Robertson, S.E. and Hancock-Beaulieu, M.M. (1992). On the evaluation of IR systems. In: Information Processing & Management, 28 (4), pp. 457-466.

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 20: DE Conferentie 2004 Pia Borlund

Recall = a = Relevant documents retrieved a + c Relevant documents in collection

Precision = a = Relevant documents retrieved a + b Documents retrieved

Fallout = b = Non-relevant documents retrieved b + d Non-relevant documents in collection

Relevant

Non-relevant

Retrieved a Hits

b Noise

a + b

Not Retrieved

c Misses

d Rejected

c + d

a + c

b + d N = a + b + c + d Total collection

Contingency table – performance measures:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 21: DE Conferentie 2004 Pia Borlund

Short ‘cover story’ that describes a situation which leads to IR

…Serves 2 main functions, it:1) Triggers the simulated information need 2) Is the platform against which situational relevance is assessed

…More specifically, it describes:• The source of the information need • The environment of the situation• The problem which has to be solved, and • Serves to make the test person understand the objective of the

search

… Further, by being the same for all the test persons experimental control is provided

… As such the concept of simulated work task situations ensures the experiment both realism and control

Definition ofsimulated work task situation:

Pia Borlund DEN conference, Arnhem, December 1-2 2004

Page 22: DE Conferentie 2004 Pia Borlund

Simulated situation: sim A Simulated work task situation: After your graduation you will be looking for a job in industry. You want information to help you focus your future job seeking. You know it pays to know the market. You would like to find some information about employment patterns in industry and what kind of qualifications employers will be looking for from future employees. Indicative request: Find for instance something about future employment trends in industry, i.e. areas of growth and decline.

Example of simulated situation /simulated work task situation

(Kilde: Borlund, 2000)

Pia Borlund DEN conference, Arnhem, December 1-2 2004