evaluation inst 734 module 5 doug oard. agenda evaluation fundamentals test collections: evaluating...
TRANSCRIPT
Evaluation
INST 734
Module 5
Doug Oard
Agenda
Evaluation fundamentals
• Test collections: evaluating sets
• Test collections: evaluating rankings
• Interleaving
• User studies
IR as an Empirical Discipline
• Formulate a research question (the hypothesis)• Design an experiment to answer the question• Perform the experiment
– Compare with a baseline “control”
• Does the experiment answer the question?– Are the results significant? Or is it just luck?– Are the results important, or imperceptable?
• Report the results
Types of Evaluation
• Intrinsic– Does it do what we want?
• Extrinsic– Does it do what we need?
• Formative– Provide a basis for system development
• Summative– Determine whether objectives were met
Experiment Design Examples• Can morphology improve effectiveness?
– Does stemming beat an unstemmed baseline?
• Does query expansion improve effectiveness?– Does synonym expansion beat an unexpanded baseline?
• Does highlighting help users evaluate utility?– Build two interfaces, one with highlighting, one
without– Ask users which one they prefer and why
• Is letting users weight query terms a good idea?– Build two systems, one with weighting, one without– Measure which yields more relevant docs in 10 minutes
Evaluation Criteria
• Effectiveness– System-only– Human + system
• Efficiency– Retrieval time, indexing time, index size, …
• Usability– Learnability, novice use, expert use, …
IR Effectiveness Evaluation
• User-centered strategy– Given several users, and at least 2 retrieval systems– Have each user try the same task on both systems– Measure which system works the “best”
• System-centered strategy– Given documents, queries, and relevance judgments– Try several variations on the retrieval system– Measure which ranks more good docs near the top
Good Measures of Effectiveness
• Capture some aspect of what the user wants
• Have predictive value for other situations– Different queries, different document collection
• Easily replicated by other researchers
• Easily compared– Optimally, expressed as a single number
Agenda
• Evaluation fundamentals
Test collections: evaluating sets
• Test collections: evaluating rankings
• Interleaving
• User studies