modeling (chap. 2) modern information retrieval spring 2000

Modeling (Chap. 2)Modern Information Retrieval

Spring 2000

Introduction Traditional IR systems adopt index

terms to index, retrieve documents An index term is simply any word

that appears in text of documents Retrieval based on index terms is

simple premise is that semantics of documents and

user information can be expressed through set of index terms

Key Question semantics in document (user request)

lost when text replaced with set of words matching between documents and user

request done in very imprecise space of index terms (low quality retrieval)

problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)

Taxonomy of IR Models Three classic models

Boolean documents and queries represented

as sets of index terms Vector

documents and queries represented as vectors in t-dimensional space

Probabilistic document and query representations

based on probability theory

Basic Concepts Classic models consider that each

document is described by index terms

Index term is a (document) word that helps in remembering document’s main themes index terms used to index and summarize

document content in general, index terms are nouns (because

meaning by themselves) index terms may consider all distinct words

in a document collection

Distinct index terms have varying relevance when describing document contents

Thus numerical weights assigned to each index term of a document

Let ki be index term, dj document, and wi,j 0 be weight for pair (ki, dj)

Weight quantifies importance of index term for describing document semantic contents

Definition (pp. 25)

Let t be no. of index terms in system and k i be generic index term.

K = {k1, …, kt} is set of all index terms.

A weight wi,j > 0 associated with each index term ki of document dj.

For index term that does not appear in document text, wi,j = 0.

Document dj associated with index term vector j represented by j = (w1,j, w2,j, …wt,j)

d

d

Boolean Model Simple retrieval model based on set

theory and Boolean algebra framework is easy to grasp by users

(concept of set is intuitive) Queries specified as Boolean

expressions which have precise semantics

Drawbacks Retrieval strategy is binary decision

(document is relevant/non-relevant) prevents good retrieval performance

not simple to translate information need into Boolean expression (difficult and awkward to express)

dominant model with commercial DB systems

Boolean Model (Cont.) Considers that index terms are

present or absent in document index term weights are binary, I.e.

wi,j {0,1} query q composed of index terms

linked by not, and, or query is Boolean expression which

can be represented as DNF

Boolean Model (Cont.) Query [q=ka (kb kc)] can be written in

DNF as [ dnf = (1,1,1) (1,1,0) (1,0,0)] each component is binary weighted vector

associated with tuple (ka, kb, kc)

binary weighted vectors are called conjunctive components of dnf

q

q

Boolean Model (cont.) Index term weight variables are all binar

y, I.e. wi,j {0,1} query q is a Boolean expression Let dnf be DNF for query q

Let cc be any conjunctive components of dnf

Similarity of document dj to query q is sim(dj,q) = 1 if cc | ( cc dnf) (ki,gi( j) = gi

( cc)) where gi( j) = wi,j

sim(dj,q) = 0 otherwise

q

q

q

q

q

q

q d

d

Boolean Model (Cont.)

If sim(dj,q) = 1 then Boolean model predict that document dj is relevant to query q (it might not be)

Otherwise, prediction is that document is not relevant

Boolean model predicts that each document is either relevant or non-relevant

no notion of partial match

Main advantages clean formalism simplicity

Main disadvantages exact matching lead to retrieval of too

few or too many documents index term weighting can lead to impr

ovement in retrieval performance

Vector Model Assign non-binary weights to index terms

in queries and documents term weights used to compute degree of

similarity between document and user query

by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents ranked document answer set a lot more preci

se (than answer set by Boolean model)

Vector Model (Cont.)

Weight wi,j for pair (ki, dj) is positive and non-binary

index terms in query are also weighted Let wi,q be weight associated with pair

[ki,q], where wi,q 0

query vector defined as = (w1,q, w2,q, …, wt,q) where t is total no. of index terms in system

vector for document dj is represented by j = (w1,j, w2,j, …, wt,j)

q

q

d


Document dj and user query q represented as t-dimensional vectors.

evaluate degree of similarity of dj with regard to q as correlation between vectors j and .

Correlation can be quantified by cosine of angle between these two vectors sim(dj,q) =

d

q

|||| qjd

qjd


Sim(q,dj) varies from 0 to +1. Ranks documents according to degree of

similarity to query document may be retrieved even if it

partially matches query establish a threshold on sim(dj,q) and retrieve

documents with degree of similarity above threshold

Index term weights Documents are collection C of objects User query is set A of objects IR problem is to determine which

documents are in set A and which are not (I.e. clustering problem)

In clustering problem intra-cluster similarity (features which better

describe objects in set A) inter-cluster similarity (features which better

distinguish objects in set A from remaining objects in collection C

In vector model, intra-cluster similarity quantified by measuring raw frequency of term ki inside document dj (tf factor) how well term describes document contents

inter-cluster dissimilarity quantified by measuring inverse of frequency of term ki

among documents in collection (idf factor) terms which appear in many documents are

not very useful for distinguishing relevant document from non-relevant one

Definition (pp.29) Let N be total no. of documents in system let ni be number of documents in which

index term ki appears

let freqi,j be raw frequency of term ki in document dj

no. of times term ki mentioned in text of document dj

Normalized frequency fi,j of term ki in dj

fi,j =

jfreql

jfreqi

,max

,

Maximum computed over all terms mentioned in text of document dj

if term ki does not appear in document dj then fi,j = 0

let idfi, inverse document frequency for ki be idfi = log

best known term weighting scheme wi,j = fi,j log

ni

N

ni

N

Advantages of vector model term weighting scheme improves retrieval

performance retrieve documents that approximate query

conditions sorts documents according to degree of

similarity to query Disadvantage

index terms are mutually independent

Probabilistic Model Given user query, there is set of

documents containing exactly relevant documents. Ideal answer set

given description of ideal answer set, no problem in retrieving its documents

querying process is process of specifying properties of ideal answer set the properties are not exactly known there are index terms whose semantics are

used to characterize these properties

Probabilistic Model (Cont.) These properties not known at query

time effort has to be made to initially guess

what they (I.e. properties) are initial guess generate preliminary

probabilistic description of ideal answer set to retrieve first set of documents

user interaction initiated to improve probabilistic description of ideal answer set

User examine retrieved documents and decide which ones are relevant

this information used to refine description of ideal answer set

by repeating this process, such description will evolve and be closer to ideal answer set

Fundamental Assumption

Given user query q and document dj in collection, probabilistic model estimate probability that user will find document dj

relevant assumes that probability of relevance depends on

query and document representations only assumes that there is subset of all documents

which user prefers as answer set for query q such ideal answer set is labeled R documents in set R are predicted to be relevant to

query

Given query q, probabilistic model

assigns to each document dj the ratio P(dj relevant-to q)/P(dj non-relevant-to q) measure of similarity to query odds of document dj being relevant to

query q

Index term weight variables are all binary I.e. wi,j {0,1}, wi,q {0,1}

query q is subset of index terms let R be set of documents known (initially

guessed) to be relevant let be complement of R let P(R| j) be probability that document d j

is relevant to query q let P( | j) be probability that document

dj not relevant to query q.

R

d

R d

Similarity sim(dj,q) of document dj to query q is ratio

sim(dj,q) =

sim(dj,q) ~

sim(dj,q) ~ wi,q wi,j

)|(

)|(

jdRP

jdRP

)|(

)|(

RjdP

RjdP

)|(

)|(1log

)|(1

)|(log

RkiP

RkiP

RkiP

RkiP

t

i 1

How to compute P(ki|R) and P(ki| ) initially ? assume P(ki|R) is constant for all index terms

ki (typically 0.5)

P(ki|R) = 0.5

assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection

P(ki| ) = ni/N where ni is no. of documents containing index term ki; N is total no. of doc.

R

R

Let V be subset of documents initially retrieved and ranked by model

let Vi be subset of V composed of documents in V with index term ki

P(ki|R) approximated by distribution of index term ki among doc. retrieved P(ki|R) = Vi / V

P(ki| ) approximated by considering all non-retrieved doc. are not relevant P(ki| ) =

R

VN

Vini

R

Advantages documents ranked in decreasing order of

their probability of being relevant Disadvantages

need to guess initial separation of relevant and non-relevant sets

all index term weights are binary index terms are mutually independent

modeling (chap. 2) modern information retrieval spring 2000

Documents

index terms index term

set of index terms

distinct index terms

document index term

index term vector j

importance of index

generic index term

j slide