retrieval models

49
Retrieval Models Probabilistic and Language Models

Upload: viveka

Post on 08-Jan-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Retrieval Models. Probabilistic and Language Models. Basic Probabilistic Model. D represented by binary vector d = (d 1 , d 2 , …d n ) where d i = 0/1 indicates absence/presence of the term d i p i = P(d i =1|R) and 1-p i = P(d i =0|R) q i = P(d i =1|NR) and 1- q i = P(d i =0|NR) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Retrieval Models

Retrieval Models

Probabilistic and Language Models

Page 2: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Basic Probabilistic Model

• D represented by binary vector d = (d1, d2, …dn) where di = 0/1 indicates absence/presence of the term di – pi = P(di =1|R) and 1-pi = P(di =0|R) – qi = P(di =1|NR) and 1- qi = P(di =0|NR)

• Assume conditional independence: P(d|R) is the product of the probabilities for the components of d (i.e. product of probabilities of getting a particular vector of 1’s and 0’s)

• Likelihood estimate converted to linear discrimination funcg(d) = log P(d|R)/P(d|NR) + log P(R)/P(NR)

Page 3: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Basic Probabilistic Model• Need to calculate

– “relevant to query given term appears” and – “irrelevant to query given term appears”

• These values can be based upon known rel judgments– ratio of rel with term to relevant without term

ratio of nonrel with term to nonrel without term

• Rarely have relevance information– Estimating probabilities is the same problem as determining

weighting formulae in less formal models• constant (Croft and Harper combination match)• proportional to probability of occurrence in collection• more accurately, proportional to log(probability of occurrence)

– Greif, 1998

Page 4: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

What is a Language Model?

• Probability distribution over strings of a text– How likely is a given string in a given “language”?– e.g. consider probability for the following strings– English: p1 > p2 > p3 > p4

• … depends on what “language” we are modeling– In most of IR, assume that p1 == p2– For some applications we will want p3 to be highly probable

– p1 = P(“a quick brown dog”)

– p2 = P(“dog quick a brown”)

– p3 = P(“ бЬІсТрая brown dog”)

– p4 = P(“ бЬІсТрая СОбаКа”)

Page 5: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Basic Probability Review

• Probabilities– P(s) = probability of event s occuring e.g. P(“moo”)

– P(s|M) = probability of s occurring given M• P(“moo”|”cow”) > P(“moo”|”cat”)

– (Sum of P(s) over all s) = 1, P(not s) = 1- P(s)

• Independence– If events w1 and w2 are independent, then

P(w1 AND w2) = P(w1) * P(w2), so …

• Bayes’ rule:)|()|,...,,(

121 DqPDqqqP

k

iik

)()()|(

)( )()|( APBPBAP

APBAPABP

Page 6: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Language Models (LMs)

• What are we modeling?– M: “language” we are trying to model

– s: observation (string of tokens from vocabulary)

– P(s|M): probability of observing “s” in M

• M can be thought of as a “source” or a generator– A mechanism that can create strings that are legal in M,

– P(s|M) -> probability of getting “s” during random sampling from M

Page 7: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

LMs in IR

• Task: given a query, retrieve relevant documents

• Use LMs to model the process of query generation

• Every document in a collection defines a language– Consider all possible sentences author could have written

down

– Some may be more likely to occur than others• Depend upon subject, topic, writing style, language, etc

– P(s|M) -> probability author would write down string “s”• Like writing a billion variations of a doc and counting # times we see

“s”

Page 8: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

LMs in IR

• Now suppose “Q” is the user’s query– What is the probability that the author would write down

“q”?

• Rank documents D in the collection by P(Q|MD)– Probability of observing “Q” during random sampling from

the language model of document D

Page 9: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

LMs in IR• Advantages:

– Formal mathematical model (theoretical foundation)

– Simple, well understood framework

– Integrates both indexing and retrieval models

– Natural use of collection statistics

– Avoids tricky issues of “relevance”, “aboutness”

• Disadvantages:– Difficult to incorporate notions of “relevance”, user preferences

– Relevance feedback/query expansion is not straightforward

– Can’t accommodate phrases, passages, Boolean operators

• There are extensions of LM which overcome some issues

Page 10: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Major Issues of applying LMs

• What kind of LM should we use?– Unigram

– Higher-order model

• How can we estimate model parameters?– Can use smoothed relative frequency (counting) for

estimation

• How can we use the model for ranking?

Page 11: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Unigram LMs

• Words are “sampled” independently of each other– Metaphor: randomly pulling words from an urn “with

replacement”

– Joint probability decomposes into a product of marginals

– Estimation of probabilities: simple counting

Page 12: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Higher-order LMs

• Unigram model assumes word independence– Cannot capture surface form: P(“fat cat”) = P(“cat fat”)

• Higher-order models– N-gram: conditioning on preceding words

– Cache: condition on a window

– Grammar: condition on parse tree

• Are they useful?– No improvements from n-gram, grammar-based models

– Some work on cache-like models

– Parameter estimation is prohibitively expensive!

Page 13: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Predominant Model is multinomial• Predominant Model is multinomial

– Fundamental event: what is the identity of the i’th query token?– Observation is a sequence of events, one for each query token

• Original model is multiple-Bernouli– Fundamental event: does the word w occur in the query?– Observation is vector of binary events, 1 for each possible

word

Page 14: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Multinomial or multiple Bernouli?

• Two models are fundamentally different– entirely different event spaces (random variables involved)

– both assume word independence (though it has different meanings)

– both use smoothed relative-frequency (counting) for estimation

• Multinomial– can account for multiple word occurrences in the query

– well understood: lots of research in related fields (and now in IR)

– possibility for integration with ASR/MT/NLP (same event space)

• Multiple-Bernoulli– arguably better suited to IR (directly checks presence of query terms)

– provisions for explicit negation of query terms (“A but not B”)

– no issues with observation length• Binary events, either occur or do not occur

Page 15: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Ranking with Language Models

• Standard approach: query-likelihood– Estimate an LM, MD for every document D in the collection

– Rank docs by the probability of “generating” the query from the document

P(q1 … qk| MD) = P(qi| MD)

• Drawbacks:– no notion of relevance: everything is random sampling

– user feedback / query expansion not part of the model• examples of relevant documents cannot help us improve the LM MD

• the only option is augmenting the original query Q with extra terms

• however, we could make use of sample queries for which D is relevant

– does not directly allow weighted or structured queries

Page 16: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Ranking: Document-likelihood

• Flip the direction of the query-likelihood approach– estimate a language model MQ for the query Q

– rank docs D by the likelihood of being a random sample from MQ

– MQ expected to “predict” a typical relevant document

• Problems:– different doc lengths, probabilities not comparable

– favors documents that contain frequent (low content) words

– consider “ideal” (highest-ranked) document for a given query

Dw

QQ MwPMDP )|()|(

Page 17: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Ranking: Model Comparison

• Combine advantages of 2 ranking methods– estimate a model of both the query MQ and the document MD

– directly compare similarity of the two models

– natural measure of similarity is cross-entropy (others exist):

• Cross-entropy is not symmetric: use H(MQ||MD)

– Reverse works consistently worse, favors different doc

– Use reverse if ranking multiple queries wrt one document

)|(log)|()||( Dw

QDQ MwPMwPMMH

Page 18: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Summary of LM

• Use Unigram models– No consistent benefit from using higher order models– Estimation is much more complex (bi-gram, etc)

• Use multinomial models– Well-studied, consistent with other fields– Extend multiple-Bernoulli model to non-binary events?

• Use Model comparison for ranking– Allows feedback, expansion, etc

• Evaluation of is a crucial step– very significant impact on performance (more than other

choices)– key to cross-language, cross-media and other applications

Page 19: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Estimation

• Want to estimate MQ and/or MD from Q and/or D• General problem:

– given a string of text S (= Q or D), estimate its language model MS

– S is commonly assumed to be an i.i.d. random sample from MS• Independent and identically distributed

• Basic Language Models:– maximum-likelihood estimator and the zero frequency problem– discounting techniques:

• Laplace correction, Lindstone correction, absolute discounting, leave one-out discounting, Good-Turing method

– interpolation/back-off techniques:• Jelinek-Mercer smoothing, Dirichlet smoothing, Witten-Bell smoothing,

Zhai-Lafferty two-stage smoothing, interpolation vs. back-off techniques

– Bayesian estimation

Page 20: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Maximum Likelihood

• Count relative frequencies of words in S

– Pml(w| MS) = #(w,S)/|S|

• maximum-likelihood property:– assigns highest possible likelihood to the observation

• unbiased estimator:– if we repeat estimation an infinite number of times with

different starting points S, we will get correct probabilities (on average)

– this is not very useful…

Page 21: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

The Zero-frequency problem

• Suppose some event not in our observation S– Model will assign zero probability to that event

– And to any set of events involving the unseen event

• Happens very frequently with language

• It is incorrect to infer zero probabilities– especially when creating a model from short samples

Page 22: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Discounting Methods• Laplace correction:

– add 1 to every count, normalize

– problematic for large vocabularies

• Lindstone correction:– add a small constant ε to every count, re-normalize

• Absolute Discounting– subtract a constant ε, re-distribute the probability mass

Page 23: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Discounting Methods

• Smoothing: Two possible approaches

• Interpolation– Adjust probabilities for all events, both seen and unseen

• “interpolate” ML estimates with General English expectations (computed as relative frequency of a word in a large collection)

• reflects expected frequency of events

• Back-off:– Adjust probabilities only for unseen events

– Leave non-zero probabilities as they are

– Rescale everything to sum to one:• rescales “seen” probabilities by a constant

• Interpolation tends to work better– And has a cleaner probabilistic interpretation

Page 24: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Types of Evaluation

• Might evaluate several aspects

• Evaluation generally comparative– System A vs. B

– System A vs A´

• Most common evaluation - retrieval effectiveness

– Assistance in formulating queries– Speed of retrieval– Resources required– Presentation of documents– Ability to find relevant documents

Page 25: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

The Concept of Relevance

• Relevance of a document D to a query Q is subjective

– Different users will have different judgments

– Same users may judge differently at different times

– Degree of relevance of different documents may vary

Page 26: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

The Concept of Relevance

• In evaluating IR systems it is assumed that:– A subset of the documents of the database (DB) are

relevant

– A document is either relevant or not

Page 27: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Relevance

• In a small collection - the relevance of each document can be checked

• With real collections, never know full set of relevant documents

• Any retrieval model includes an implicit definition of relevance

– Satisfiability of a FOL expression– Distance– P(Relevance|query,document)

Page 28: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Evaluation• Set of queries

• Collection of documents (corpus)• Relevance judgements: Which documents are correct and

incorrect for each query

Potato farming andnutritional value ofpotatoes.

Mr. Potato Head … nutritional info for spuds

potato blight …growing potatoes …

x

x

x

• If small collection, can review all documents• Not practical for large collections

Any ideas about how we might approach collecting relevancejudgments for very large collections?

Page 29: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Finding Relevant Documents

• Pooling– Retrieve documents using several auto techniques

– Judge top n documents for each technique

– Relevant set is union

– Subset of true relevant set

• Possible to estimate size of relevant set by sampling

• When testing:– How should unjudged documents be treated?

– How might this affect results?

Page 30: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Test Collections

• To compare the performance of two techniques:

– each technique used to evaluate same queries– results (set or ranked list) compared using metric– most common measures - precision and recall

• Usually use multiple measures to get different views of performance

• Usually test with multiple collections – – performance is collection dependent

Page 31: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

EvaluationRetrieved documents

Relevantdocuments

Rel&Retdocuments

RelevantRet & Rel Recall

Ability to return ALL relevant items.

Ret & Rel Precision Retrieved

Ability to return ONLY relevant items.

Let retrieved = 100, relevant = 25, rel & ret = 10

Recall = 10/25 = .40

Precision = 10/100 = .10

Page 32: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Precision and Recall

• Precision and recall well-defined for sets

• For ranked retrieval– Compute value at fixed recall points (e.g. precision at

20% recall)

– Compute a P/R point for each relevant document, interpolate

– Compute value at fixed rank cutoffs (e.g. precision at rank 20)

Page 33: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)

• Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• |Rq| = 10, no. of relevant docs for q

• Ranking of retreived docs in the answer set of q:

10 % Recall=> .1 * 10 rel docs

= 1 rel doc retrieved

One doc retrieved to get 1 rel doc:

precision = 1/1 = 100%

Precision at Fixed Recall: 1 Qry

Find precision given total number of docs retrieved at given recall value.

Page 34: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)

• Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• |Rq| = 10, no. of relevant docs for q

• Ranking of retreived docs in the answer set of q:

10 % Recall=> .1 * 10 rel docs

= 1 rel doc retrieved

One doc retrieved to get 1 rel doc:

precision = 1/1 = 100%

Precision at Fixed Recall: 1 Qry

20% Recall: .2 * 10 rel docs

= 2 rel docs retrieved3 docs retrieved to get 2 rel docs:

precision = 2/3 = 0.667

Find precision given total number of docs retrieved at given recall value.

Page 35: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)

• Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• |Rq| = 10, no. of relevant docs for q

• Ranking of retreived docs in the answer set of q:

10 % Recall=> .1 * 10 rel docs

= 1 rel doc retrieved

One doc retrieved to get 1 rel doc:

precision = 1/1 = 100%

Precision at Fixed Recall: 1 Qry

What is precision at recall values from 40-100%?

20% Recall: .2 * 10 rel docs

= 2 rel docs retrieved3 docs retrieved to get 2 rel docs:

precision = 2/3 = 0.667

30% Recall: .3 * 10 rel docs

= 3 rel docs retrieved6 docs retrieved to get 3 rel docs:

precision = 3/6 = 0.5

Page 36: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

• |Rq| = 10, no. of relevant docs for q

• Ranking of retreived docs

in the answer set of q:

Recall/ Precision Curve

0

20

40

60

80

100

120

20 40 60 80 100 120

Prec

isio

n

Recall

••

••

• • • • •

1. d123 (*) 5. d8 9. d187 13. d250

2. d84 6. d9 (*) 10. d25 (*) 14. d113

3. d56 (*) 7. d511 11. d38 15. d3 (*)

4. d6 8. d129 12. d48

Recall Precision0.1 1/1 = 100%0.2 2/3 = 0.67%0.3 3/6 = 0.5%0.4 4/10 = 0.4%0.5 5/15 = 0.33%0.6 0%… …1.0 0%

Page 37: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Averaging

• Hard to compare individual P/R graphs or tables

• microaverage - each rel doc is a point in the avg– done with respect to a parameter

• e.g. coordination level matching (# of shared query terms)

– average across the total number of relevant documents across each match level

• Let L = relevant docs, T = retrieved docs, = coordination level

Qq

qLL~

= total # relevant docs for all queries

Qq

qTT ~

= total # retrieved docs at or above for all queries

Page 38: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Averaging• Let 100 & 80 be the # of relevant docs for queries 1 &2, respectively

– calculate actual recall and precision values for both queries

|T1| |L1 T1| |T2| |L2 T2| R P

1 10 10 10 8

2 25 20 40 24

3 66 40 80 40

4 150 60 140 56

5 266 80 180 72

Query Recall 10/100= 0.1 20/100=0.2 40/100=0.4 60/100=0.6 80/100=0.8

1 Prec 10/10=1.0 20/25=0.8 40/66=0.6 60/150=0.4 80/255=0.3

Query Recall 8/80 = 0.1 24/80 = 0.3 40/80=0.5 56/80=0.7 72/80=0.9

2 Prec 8/10=0.8 24/40=0.6 40/80=0.5 56/140=0.4 72/180=0.4

Hard to compare individual Prec/Recall tables, so take averages:

10+8/180 = 0.1 10+8/10+10 = 0.9

20+24/180 = 0.24

There are 80+100 = 180 relevant docs for all queries

20+24/25+40 = 0.68

40+40/180 = 0.44

60+56/180=0.64

80+72/180=0.84

40+40/66+80 = 0.55

60+56/290 = 0.4

80+72/446 = 0.34

Page 39: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Averaging and Interpolation

• macroaverage - each query is a point in the avg– can be independent of any parameter– average of precision values across several queries at standard

recall levels

e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20

– their actual recall points are: .33, .67, and 1.0 (why?)

– their precision is .25, .22, and .15 (why?)

• Average over all relevant docs – rewards systems that retrieve relevant docs at the top

(.25+.22+.15)/3= 0.21

Page 40: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Averaging and Interpolation

• Interpolation– actual recall levels of individual queries are seldom equal to

standard levels

– interpolation estimates the best possible performance value between two known values

– e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20

– their precision at actual recall is .25, .22, and .15

Page 41: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Averaging and Interpolation

• Actual recall levels of individual queries are seldom equal to standard levels

• Interpolated precision at the ith recall level, Ri, is the maximum precision at all points p such that

Ri p Ri+1

– assume only 3 relevant docs retrieved at ranks 4, 9, 20

– their actual recall points are: .33, .67, and 1.0

– their precision is .25, .22, and .15 – what is interpolated precision at standard recall points?

Recall level Interpolated Precision0.0, 0.1, 0.2, 0.3 0.250.4, 0.5, 0.6 0.220.7, 0.8, 0.9, 1.0 0.15

Page 42: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Recall-Precision Tables & GraphsPrecision – 44 queries

Recall Terms Phrases0 88.2 90.8 (+2.9)

10 82.4 86.1 (+4.5)20 77.0 79.8 (+3.6)30 71.1 75.6 (+5.4)40 65.1 68.7 (+5.4)50 60.3 64.1 (+6.2)60 53.3 55.6 (+4.4)70 44.0 47.3 (+7.5)80 37.2 39.0 (+4.6)90 23.1 26.6 (+15.1)

100 12.7 14.2 (+11.4)average 55.9 58.9 (+5.3)

0102030405060708090

100

0 20 40 60 80 100

Recall

Pre

cisi

onTerms

Phrases

Page 43: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Document Level Averages

• Precision after a given number of docs retrieved– e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents

• Reflects the actual system performance as a user might see it

• Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries– e.g. average precision for all queries at the point where n

docs have been retrieved

Page 44: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

R-Precision

• Precision after R documents are retrieved– R = number of relevant docs for the query

• Average R-Precision– mean of the R-Precisions across all queries

e.g.) Assume 2 qrys having 50 & 10 relevant docs;

system retrieves 17 and 7 relevant docs in

the top 50 and 10 documents retrieved, respectively

52.02

107

5017

PrecisionR

Page 45: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Evaluation

• Recall-Precision value pairs may co-vary in ways that are hard to understand

• Would like to find composite measures– A single number measure of effectiveness

• primarily ad hoc and not theoretically justifiable

• Some attempt to invent measures that combine parts of the contingency table into a single number measure

Page 46: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Contingency Table

Relevant Not RelevantRetrieved A B

Not Retrieved C D

Relevant = A + CRetrieved = A + BCollection size = A + B + C + D

Precision = A/(A+B)Recall = A/(A+C)

Fallout = B/(B+D) = P(retrieved | not relevant)

Page 47: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Symmetric Difference

A is the retrieved set of documentsB is the relevant set of documents

A B (the symmetric difference) is the shaded area

Page 48: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

E measure (van Rijsbergen)

• used to emphasize precision or recall– like a weighted average of precision and recall

• large increases importance of precision – can transform by = 1/(2 +1), = P/R

– when = 1/2, = 1; precision and recall are equally important

E= normalized symmetric difference of retrieved and relevant setsE b=1 = |A B|/(|A| + |B|)

• F =1- E is typical (good results mean larger values of F)

)1

)(1()1

(

11

RP

E

RP

PRF

2

2 )1(

Page 49: Retrieval Models

Allan, Ballesteros, Croft, and/or Turtle

Other Single-Valued Measures

• Breakeven point– point at which precision = recall

• Expected search length– save users from having to look thru non-relevant docs

• Swets model– use statistical decision theory to express recall, precision,

and fallout in terms of conditional probabilities

• Utility measures– assign costs to each cell in the contingency table– sum (or average) costs for all queries

• Many others...