evaluation inst 734 module 5 doug oard. agenda evaluation fundamentals test collections: evaluating...

9
Evaluation INST 734 Module 5 Doug Oard

Upload: russell-booth

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Evaluation

INST 734

Module 5

Doug Oard

Page 2: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Agenda

Evaluation fundamentals

• Test collections: evaluating sets

• Test collections: evaluating rankings

• Interleaving

• User studies

Page 3: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

IR as an Empirical Discipline

• Formulate a research question (the hypothesis)• Design an experiment to answer the question• Perform the experiment

– Compare with a baseline “control”

• Does the experiment answer the question?– Are the results significant? Or is it just luck?– Are the results important, or imperceptable?

• Report the results

Page 4: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Types of Evaluation

• Intrinsic– Does it do what we want?

• Extrinsic– Does it do what we need?

• Formative– Provide a basis for system development

• Summative– Determine whether objectives were met

Page 5: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Experiment Design Examples• Can morphology improve effectiveness?

– Does stemming beat an unstemmed baseline?

• Does query expansion improve effectiveness?– Does synonym expansion beat an unexpanded baseline?

• Does highlighting help users evaluate utility?– Build two interfaces, one with highlighting, one

without– Ask users which one they prefer and why

• Is letting users weight query terms a good idea?– Build two systems, one with weighting, one without– Measure which yields more relevant docs in 10 minutes

Page 6: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Evaluation Criteria

• Effectiveness– System-only– Human + system

• Efficiency– Retrieval time, indexing time, index size, …

• Usability– Learnability, novice use, expert use, …

Page 7: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

IR Effectiveness Evaluation

• User-centered strategy– Given several users, and at least 2 retrieval systems– Have each user try the same task on both systems– Measure which system works the “best”

• System-centered strategy– Given documents, queries, and relevance judgments– Try several variations on the retrieval system– Measure which ranks more good docs near the top

Page 8: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Good Measures of Effectiveness

• Capture some aspect of what the user wants

• Have predictive value for other situations– Different queries, different document collection

• Easily replicated by other researchers

• Easily compared– Optimally, expressed as a single number

Page 9: Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving

Agenda

• Evaluation fundamentals

Test collections: evaluating sets

• Test collections: evaluating rankings

• Interleaving

• User studies