evaluation inst 734 module 5 doug oard. agenda evaluation fundamentals test collections: evaluating...

Evaluation

INST 734

Module 5

Doug Oard

Agenda

Evaluation fundamentals

• Test collections: evaluating sets

• Test collections: evaluating rankings

• Interleaving

• User studies

IR as an Empirical Discipline

• Formulate a research question (the hypothesis)• Design an experiment to answer the question• Perform the experiment

– Compare with a baseline “control”

• Does the experiment answer the question?– Are the results significant? Or is it just luck?– Are the results important, or imperceptable?

• Report the results

Types of Evaluation

• Intrinsic– Does it do what we want?

• Extrinsic– Does it do what we need?

• Formative– Provide a basis for system development

• Summative– Determine whether objectives were met

Experiment Design Examples• Can morphology improve effectiveness?

– Does stemming beat an unstemmed baseline?

• Does query expansion improve effectiveness?– Does synonym expansion beat an unexpanded baseline?

• Does highlighting help users evaluate utility?– Build two interfaces, one with highlighting, one

without– Ask users which one they prefer and why

• Is letting users weight query terms a good idea?– Build two systems, one with weighting, one without– Measure which yields more relevant docs in 10 minutes

Evaluation Criteria

• Effectiveness– System-only– Human + system

• Efficiency– Retrieval time, indexing time, index size, …

• Usability– Learnability, novice use, expert use, …

IR Effectiveness Evaluation

• User-centered strategy– Given several users, and at least 2 retrieval systems– Have each user try the same task on both systems– Measure which system works the “best”

• System-centered strategy– Given documents, queries, and relevance judgments– Try several variations on the retrieval system– Measure which ranks more good docs near the top

Good Measures of Effectiveness

• Capture some aspect of what the user wants

• Have predictive value for other situations– Different queries, different document collection

• Easily replicated by other researchers

• Easily compared– Optimally, expressed as a single number

Agenda

• Evaluation fundamentals

Test collections: evaluating sets

• Test collections: evaluating rankings

• Interleaving

• User studies

evaluation inst 734 module 5 doug oard. agenda evaluation fundamentals test collections: evaluating...

Documents