probabilistic retrieval

15
INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com

Upload: otisg

Post on 15-Jan-2015

2.767 views

Category:

Technology


1 download

DESCRIPTION

From http://www.meetup.com/NYC-Search-and-Discovery/calendar/11745435/

TRANSCRIPT

Page 1: Probabilistic Retrieval

INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com

Page 2: Probabilistic Retrieval

Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model

Page 3: Probabilistic Retrieval

Boolean Retrieval   lincoln AND NOT (car AND automobile)  The earliest model and still in use today  The result is very easy to explain to users  Highly efficient computationally  The major drawback – lack of sophisticated

ranking algorithm.

Page 4: Probabilistic Retrieval

Vector Space Model

Doc2

Doc1

Query

Term3

Ter

m2

Cos(Di,Q) =

dij *q jj=1

t

dij2 * q j

2

j=1

t

∑j=1

t

Major flaws: It lacks guidance on the details of how weighting and ranking algorithms are related to relevance

Page 5: Probabilistic Retrieval

Probabilistic Retrieval Model

Relevant

Non-Relevant

Document

P(R|D)

P(NR|D)

P(R |D) =P(D |R)P(R)

P(D)Bayes’ Rule

Page 6: Probabilistic Retrieval

Probabilistic Retrieval Model

  If then classify D as relevant

P(R |D) =P(D |R)P(R)

P(D)

P(NR |D) =P(D |NR)P(NR)

P(D)

P(D |R)P(R) > P(D |NR)P(NR)

Page 7: Probabilistic Retrieval

Estimate P(D|R) and P(D|NR)  Define

D = (d1,d2,...,dt )

then

P(D |R) = P(di |R)i=1

t∏

P(D |NR) = P(di |NR)i=1

t∏

  Binary Independence Model term independence + binary features in documents

Page 8: Probabilistic Retrieval

Likelihood Ratio  Likelihood ratio:

P(D |R)P(D |NR)

>P(NR)P(R)

P(D |R)P(D |NR)

=pisii:d i =1

∏ ⋅1− pi1− sii:d i = 0

∏ = log pi(1− si)si(1− pi)i:d i =1

= log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i:d i = qi =1

si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring

N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents

Page 9: Probabilistic Retrieval

Combine with BM25 Ranking Algorithm  BM25 extends the scoring function for the binary

independence model to include document and query term weight.

  It performs very well in TREC experiments

R(q,D) = log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i∈Q

∑ ⋅(ki +1) f iK + f i

⋅(k2 +1)qfik2 + qfi

K = k1((1− b) + b ⋅ dlavgdl

)

k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set qf: term frequency in query terms

Page 10: Probabilistic Retrieval

Weighted Fields Boolean Search

doc-id field0 field1 … text

1

2

3

n

R(q,D) = w f mif ∈ fileds∑

i∈q∑

Page 11: Probabilistic Retrieval

Apply Probabilistic Knowledge into Fields

doc-id field0 field1 … Text

1

2 Lightyear Buzz

3

n

Relevant

P(R|D)

Document Non-

Relevant P(NR|D)

Higher gradient Lower

Page 12: Probabilistic Retrieval

Use the Knowledge during Ranking

P(D |R) = P(di |R)i=1

t∏ = log(P(di |R)

i=1

t

∑ ) ≈ w f mif ∈F∑

i∈q∑

doc-id field0 field1 … Text

1

2 Lightyear Buzz

3

n

 The goal is:

Learnable

Page 13: Probabilistic Retrieval

Comparison of Approaches

RTF−IDF = tf ik ⋅ idfi =f ik

f ijj=1

t

∑⋅ log N

nk

R(q,D) = log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i∈Q

∑ ⋅(k1 +1) f iK + f i

⋅(k2 +1)qfik2 + qfi€

K = k1((1− b) + b ⋅ dlavgdl

)

Rbm25(q,D) =(k1 +1) f iK + f i

⋅(k2 +1)qfik2 + qfi

R(q,D) = w f mif ∈F∑

i∈q∑ ⋅

(k1 +1) f iK + f i

⋅(k2 +1)qfik2 + qfi

IDF TF

IDF TF

Page 14: Probabilistic Retrieval

Other Considerations  This is not a formal model  Require user relevance feedback (search log)  Harder to handle real-time search queries  How to Prevent Love/Hate attacks

Page 15: Probabilistic Retrieval

Thank you