october 2008iiix, london1 the study of information retrieval – a long view stephen robertson...

October 2008 IIiX, London 1

The study of information retrieval – a long view

Stephen RobertsonMicrosoft Research Cambridge and City

[email protected]


A half-century of lab experiments

Cranfield began in 1958– some precursor experiments, but can treat that as the start of

the experimental tradition in IRA brief timeline:

– 1960s & 70s: various experiments, mostly with purpose-built test collections

– late 60s on: exchange of test collections among researchers– mid-late-seventies: the ‘ideal’ test collection project– 1981: The Book (Information Retrieval Experiment, KSJ)– 1980s: relatively fallow period– 1990s to date: TREC– late 90s on: TREC spin-offs (CLEF, NTCIR, INEX etc.)– (and of course, late 90s on: web search engines)

Some highlights

(personal selection)• Cranfield 1 and 2• Smart: VSM• Medlars: indexing and searching• KSJ: term weighting; test collections• Keen: index languages• Belkin and Oddy: ASK and user models• Okapi: simple search and feedback• UMass: various experimental systems• TREC: adhoc; feedback; the web; interaction• CLEF, NTCIR, INEX, DUC etc.

[S Robertson, On the history of evaluation in IR, Journal of Information Science, Vol. 34, No. 4, 439-456 (2008)]



A half-century of lab experiments

Recapitulation of outcome(a gross over-simplification!)

– Don’t worry too much about the NLP– ... or the semantics– ... or the knowledge engineering– ... or the interaction issues– ... or the user’s cognitive processes– but pay attention to the statistics– ... and to the ranking algorithms– bag-of-words rules OK


A half-century...

... that deserves considerable celebration– but of course has a downside

So, let’s explore a little– why we do lab experiments in the first place– what the alternatives are– what they might or might not tell us– what is good and bad about them– which directions they lead us in– more importantly, which they deflect us from

and maybe, finally,– how they might be improved

Note: this is my personal take on these questions!


Abstraction

Lab experiments involve abstraction– choice of variables included/excluded– control on variables– restrictions on values/ranges of variables

[Note: models and theories also involve abstraction– but usually different abstractions, for different reasons]

Why?– First, to make them possible


Abstraction

Why else?– study simple cases– clarify relationships– reduce noise– ensure repeatability– validate abstract theories

Example: Newton’s laws



The scientific method(simple-minded outline!)

Collect empirical data– by observation and/or experiment

Formulate hypotheses/models/theoriesDerive testable predictions

– about events which may be studied empirically

Conduct further observation/experiment– designed to test predictions

Refine/reject models/theories– and reiterate


Observation versus experiment (simple-minded outline again!)

Observation Experiment

Wait for something to happen Cause something to happen

Measure control variables Set control variables

Wait for range of values to be covered Choose values to cover range

Tolerate extraneous variables Eliminate extraneous variables

Wait some more Repeat

The experimental approach is a very powerful oneGiven a simple choice, we would usually choose experiment

over observation– at least for hypothesis testing... but the choice is rarely simple


Traditional science

The traditional image of science involves experiments in laboratories– but actually this is misleading

Some sciences thrive in the laboratory– e.g. chemistry, small-scale physics

Others have made a transition– e.g. the biochemical end of biology

Others still are almost completely resistant– e.g. astrophysics, geology

(not to mention such non-traditional sciences such as economics)


Limitations of abstraction

Abstractions involve assumptions– choosing one variable and eliminating another assumes

that the two can be treated separately

– if an abstraction is built into an experiment, then its assumptions cannot be tested by the experiment

Even if we could do everything in a laboratory, we should not all do the same thing!– that is, we should not all use the same abstractions

based on the same assumptions


Limitations of abstraction

Some phenomena resist abstraction– so that an abstract representation would be

unrealistic or even illusoryThis gives us the basic conflict

– between control and realismNote: I have exaggerated the polarity between

observation and experiment– most investigations have elements of both

... but I have not exaggerated the conflict– most investigations struggle seriously with it– and have to make compromises


Research in IR

A conventional separation:– Laboratory experiments in the Cranfield/TREC

tradition, usually on ranking algorithms– Obervational experiments addressing user-oriented

issuesOf course this is over-simplified

– there are laboratory experiments addressing other issues

• semantics, language, etc.• user interaction etc.

– as well as observational experiments on algorithms


Research in IR

The Cranfield/TREC tradition is richer than it is often given credit for– TREC tracks and spin-offs have pushed the boundaries of

lab experimentation, with some different outcomesSome examples:

– QA: Here NLP and some aspects of semantics / knowledge engineering are critical

– Cross-lingual: Here we need resources constructed from comparable corpora

– The web: Here we are beginning to extract useful knowledge from usage data and resources such as wikipedia

All of these are unconventional– Although all are dominated by statistical ideas


Research in IR

Communities involved in user-oriented issues have developed laboratory methods– in interactive tasks within TREC-like projects– in new forms of lab experiments

Some core IR algorithm work is moving into observational user experiments– particularly in the web environment– particularly using click (and other user

behaviour) data


Observational IR research

Aspects that suggest an observational approach:– interaction (human-system)– collaboration (human-human)– temporal scale– user cognition– context

• task context• user knowledge



Issues:– scale

• it is hard to expand the scale of an observational study

– reproducibility• it is hard to perform an observational study in such a

way that it can be repeated by someone else

– control• it is hard to control the variables that might affect an

experiment (either the independent variables of interest, or the noise variables)



Advantages:– realism

• we have more confidence that the results of an observational study represent some kind of reality

– context• those (perhaps unknown) aspects of context that are

affect can be assumed to be present

Maybe another significant difference...


Hypothesis testing

Back to the scientific method:– need to formulate predictions as testable

hypotheses

Properly, any prediction of a model or theory is a candidate for this– the objective is to test the model or theory

• not to achieve some practical result from it

– ideally, look for critical cases• where the predictions of the model in question differ

from those of other models


IR models and theories

What are IR models designed to tell us?Different kinds of models might be expected to

explain/predict many observables

... but in the Cranfield/TREC tradition, we usually interpret them in a narrow way

specifally, we look only for effects on effectiveness

This seems to be a limitation in our ways of thinking about them


Hypothesis testing

At least some user-oriented studies in IR ask other questions– and try to develop appropriate models/theories

• e.g. about user behaviour

Obviously we are interested in making systems better...– but a model or theory may (should) tell us more

than just how to achieve that aim– and indeed other predictions may also be useful

Even statistical models could be interpreted more broadly


Other predictions(maybe accessible to statistical models)

Patterns of term occurrence– maybe simply not believable

Calibrated probabilities of relevance– hard to do but maybe useful

Clicks– probability of click– patterns of click behaviour

• e.g. click trails

Other behaviours– abandonment– reformulation– dwell time


Probabilities of relevance

Usual assumption:– do not need actual probabilities, only rank order

• the result of focussing on standard evaluation metrics

– independence models are typically bad at giving calibrated probabilities

Cooper suggested systems should give probabilities– as guide to user

There are other practical reasons– filtering

– combination of evidence


Clicks

There is a new movement in statistical modelling for IR:

– we would like to integrate aspects of user behaviour into our

models

– specifically clicks

Predicting patterns of click behaviour is a major component

– which gives us the impetus to investigate and test other kinds of

hypothesis

Might use clicks to justify effectiveness metrics

– but such predictions may also be useful for other reasons


In general

It seems to me that we should be trying to move in this direction– Constructing models or theories which are

capable of making other kinds of predictions– Devising test of these other predictions

• Laboratory tests• Observational tests

… which would encourage rapprochement between the laboratory and observational traditions

Finally

I strongly believe in the science of search– as a theoretical science

• in which models and theories have a major role to play

– and as an empirical science• requiring the full range of empirical investigations• including, specifically, both laboratory experiments

and observational studies

The lack of a strong unified theory of IR reinforces the need for good empirical work

October 2008 27IIiX, London

october 2008iiix, london1 the study of information retrieval – a long view stephen robertson...

Documents

s robertson

various experiments

simple search

study simple

abstractionlab experiments

simple choice

precursor experiments

study of information