october 2008iiix, london1 the study of information retrieval – a long view stephen robertson...
TRANSCRIPT
October 2008 IIiX, London 1
The study of information retrieval – a long view
Stephen RobertsonMicrosoft Research Cambridge and City
October 2008 IIiX, London 2
A half-century of lab experiments
Cranfield began in 1958– some precursor experiments, but can treat that as the start of
the experimental tradition in IRA brief timeline:
– 1960s & 70s: various experiments, mostly with purpose-built test collections
– late 60s on: exchange of test collections among researchers– mid-late-seventies: the ‘ideal’ test collection project– 1981: The Book (Information Retrieval Experiment, KSJ)– 1980s: relatively fallow period– 1990s to date: TREC– late 90s on: TREC spin-offs (CLEF, NTCIR, INEX etc.)– (and of course, late 90s on: web search engines)
Some highlights
(personal selection)• Cranfield 1 and 2• Smart: VSM• Medlars: indexing and searching• KSJ: term weighting; test collections• Keen: index languages• Belkin and Oddy: ASK and user models• Okapi: simple search and feedback• UMass: various experimental systems• TREC: adhoc; feedback; the web; interaction• CLEF, NTCIR, INEX, DUC etc.
[S Robertson, On the history of evaluation in IR, Journal of Information Science, Vol. 34, No. 4, 439-456 (2008)]
October 2008 IIiX, London 3
October 2008 IIiX, London 4
A half-century of lab experiments
Recapitulation of outcome(a gross over-simplification!)
– Don’t worry too much about the NLP– ... or the semantics– ... or the knowledge engineering– ... or the interaction issues– ... or the user’s cognitive processes– but pay attention to the statistics– ... and to the ranking algorithms– bag-of-words rules OK
October 2008 IIiX, London 5
A half-century...
... that deserves considerable celebration– but of course has a downside
So, let’s explore a little– why we do lab experiments in the first place– what the alternatives are– what they might or might not tell us– what is good and bad about them– which directions they lead us in– more importantly, which they deflect us from
and maybe, finally,– how they might be improved
Note: this is my personal take on these questions!
October 2008 IIiX, London 6
Abstraction
Lab experiments involve abstraction– choice of variables included/excluded– control on variables– restrictions on values/ranges of variables
[Note: models and theories also involve abstraction– but usually different abstractions, for different reasons]
Why?– First, to make them possible
October 2008 IIiX, London 7
Abstraction
Why else?– study simple cases– clarify relationships– reduce noise– ensure repeatability– validate abstract theories
Example: Newton’s laws
October 2008 IIiX, London 8
October 2008 IIiX, London 9
The scientific method(simple-minded outline!)
Collect empirical data– by observation and/or experiment
Formulate hypotheses/models/theoriesDerive testable predictions
– about events which may be studied empirically
Conduct further observation/experiment– designed to test predictions
Refine/reject models/theories– and reiterate
October 2008 IIiX, London 10
Observation versus experiment (simple-minded outline again!)
Observation Experiment
Wait for something to happen Cause something to happen
Measure control variables Set control variables
Wait for range of values to be covered Choose values to cover range
Tolerate extraneous variables Eliminate extraneous variables
Wait some more Repeat
The experimental approach is a very powerful oneGiven a simple choice, we would usually choose experiment
over observation– at least for hypothesis testing... but the choice is rarely simple
October 2008 IIiX, London 11
Traditional science
The traditional image of science involves experiments in laboratories– but actually this is misleading
Some sciences thrive in the laboratory– e.g. chemistry, small-scale physics
Others have made a transition– e.g. the biochemical end of biology
Others still are almost completely resistant– e.g. astrophysics, geology
(not to mention such non-traditional sciences such as economics)
October 2008 IIiX, London 12
Limitations of abstraction
Abstractions involve assumptions– choosing one variable and eliminating another assumes
that the two can be treated separately
– if an abstraction is built into an experiment, then its assumptions cannot be tested by the experiment
Even if we could do everything in a laboratory, we should not all do the same thing!– that is, we should not all use the same abstractions
based on the same assumptions
October 2008 IIiX, London 13
Limitations of abstraction
Some phenomena resist abstraction– so that an abstract representation would be
unrealistic or even illusoryThis gives us the basic conflict
– between control and realismNote: I have exaggerated the polarity between
observation and experiment– most investigations have elements of both
... but I have not exaggerated the conflict– most investigations struggle seriously with it– and have to make compromises
October 2008 IIiX, London 14
Research in IR
A conventional separation:– Laboratory experiments in the Cranfield/TREC
tradition, usually on ranking algorithms– Obervational experiments addressing user-oriented
issuesOf course this is over-simplified
– there are laboratory experiments addressing other issues
• semantics, language, etc.• user interaction etc.
– as well as observational experiments on algorithms
October 2008 IIiX, London 15
Research in IR
The Cranfield/TREC tradition is richer than it is often given credit for– TREC tracks and spin-offs have pushed the boundaries of
lab experimentation, with some different outcomesSome examples:
– QA: Here NLP and some aspects of semantics / knowledge engineering are critical
– Cross-lingual: Here we need resources constructed from comparable corpora
– The web: Here we are beginning to extract useful knowledge from usage data and resources such as wikipedia
All of these are unconventional– Although all are dominated by statistical ideas
October 2008 IIiX, London 16
Research in IR
Communities involved in user-oriented issues have developed laboratory methods– in interactive tasks within TREC-like projects– in new forms of lab experiments
Some core IR algorithm work is moving into observational user experiments– particularly in the web environment– particularly using click (and other user
behaviour) data
October 2008 IIiX, London 17
Observational IR research
Aspects that suggest an observational approach:– interaction (human-system)– collaboration (human-human)– temporal scale– user cognition– context
• task context• user knowledge
October 2008 IIiX, London 18
Observational IR research
Issues:– scale
• it is hard to expand the scale of an observational study
– reproducibility• it is hard to perform an observational study in such a
way that it can be repeated by someone else
– control• it is hard to control the variables that might affect an
experiment (either the independent variables of interest, or the noise variables)
October 2008 IIiX, London 19
Observational IR research
Advantages:– realism
• we have more confidence that the results of an observational study represent some kind of reality
– context• those (perhaps unknown) aspects of context that are
affect can be assumed to be present
Maybe another significant difference...
October 2008 IIiX, London 20
Hypothesis testing
Back to the scientific method:– need to formulate predictions as testable
hypotheses
Properly, any prediction of a model or theory is a candidate for this– the objective is to test the model or theory
• not to achieve some practical result from it
– ideally, look for critical cases• where the predictions of the model in question differ
from those of other models
October 2008 IIiX, London 21
IR models and theories
What are IR models designed to tell us?Different kinds of models might be expected to
explain/predict many observables
... but in the Cranfield/TREC tradition, we usually interpret them in a narrow way
specifally, we look only for effects on effectiveness
This seems to be a limitation in our ways of thinking about them
October 2008 IIiX, London 22
Hypothesis testing
At least some user-oriented studies in IR ask other questions– and try to develop appropriate models/theories
• e.g. about user behaviour
Obviously we are interested in making systems better...– but a model or theory may (should) tell us more
than just how to achieve that aim– and indeed other predictions may also be useful
Even statistical models could be interpreted more broadly
October 2008 IIiX, London 23
Other predictions(maybe accessible to statistical models)
Patterns of term occurrence– maybe simply not believable
Calibrated probabilities of relevance– hard to do but maybe useful
Clicks– probability of click– patterns of click behaviour
• e.g. click trails
Other behaviours– abandonment– reformulation– dwell time
October 2008 IIiX, London 24
Probabilities of relevance
Usual assumption:– do not need actual probabilities, only rank order
• the result of focussing on standard evaluation metrics
– independence models are typically bad at giving calibrated probabilities
Cooper suggested systems should give probabilities– as guide to user
There are other practical reasons– filtering
– combination of evidence
October 2008 IIiX, London 25
Clicks
There is a new movement in statistical modelling for IR:
– we would like to integrate aspects of user behaviour into our
models
– specifically clicks
Predicting patterns of click behaviour is a major component
– which gives us the impetus to investigate and test other kinds of
hypothesis
Might use clicks to justify effectiveness metrics
– but such predictions may also be useful for other reasons
October 2008 IIiX, London 26
In general
It seems to me that we should be trying to move in this direction– Constructing models or theories which are
capable of making other kinds of predictions– Devising test of these other predictions
• Laboratory tests• Observational tests
… which would encourage rapprochement between the laboratory and observational traditions
Finally
I strongly believe in the science of search– as a theoretical science
• in which models and theories have a major role to play
– and as an empirical science• requiring the full range of empirical investigations• including, specifically, both laboratory experiments
and observational studies
The lack of a strong unified theory of IR reinforces the need for good empirical work
October 2008 27IIiX, London