modeling (chap. 2) modern information retrieval spring 2000
TRANSCRIPT
Modeling (Chap. 2)Modern Information Retrieval
Spring 2000
Introduction Traditional IR systems adopt index
terms to index, retrieve documents An index term is simply any word
that appears in text of documents Retrieval based on index terms is
simple premise is that semantics of documents and
user information can be expressed through set of index terms
Key Question semantics in document (user request)
lost when text replaced with set of words matching between documents and user
request done in very imprecise space of index terms (low quality retrieval)
problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)
Taxonomy of IR Models Three classic models
Boolean documents and queries represented
as sets of index terms Vector
documents and queries represented as vectors in t-dimensional space
Probabilistic document and query representations
based on probability theory
Basic Concepts Classic models consider that each
document is described by index terms
Index term is a (document) word that helps in remembering document’s main themes index terms used to index and summarize
document content in general, index terms are nouns (because
meaning by themselves) index terms may consider all distinct words
in a document collection
Distinct index terms have varying relevance when describing document contents
Thus numerical weights assigned to each index term of a document
Let ki be index term, dj document, and wi,j 0 be weight for pair (ki, dj)
Weight quantifies importance of index term for describing document semantic contents
Definition (pp. 25)
Let t be no. of index terms in system and k i be generic index term.
K = {k1, …, kt} is set of all index terms.
A weight wi,j > 0 associated with each index term ki of document dj.
For index term that does not appear in document text, wi,j = 0.
Document dj associated with index term vector j represented by j = (w1,j, w2,j, …wt,j)
d
d
Boolean Model Simple retrieval model based on set
theory and Boolean algebra framework is easy to grasp by users
(concept of set is intuitive) Queries specified as Boolean
expressions which have precise semantics
Drawbacks Retrieval strategy is binary decision
(document is relevant/non-relevant) prevents good retrieval performance
not simple to translate information need into Boolean expression (difficult and awkward to express)
dominant model with commercial DB systems
Boolean Model (Cont.) Considers that index terms are
present or absent in document index term weights are binary, I.e.
wi,j {0,1} query q composed of index terms
linked by not, and, or query is Boolean expression which
can be represented as DNF
Boolean Model (Cont.) Query [q=ka (kb kc)] can be written in
DNF as [ dnf = (1,1,1) (1,1,0) (1,0,0)] each component is binary weighted vector
associated with tuple (ka, kb, kc)
binary weighted vectors are called conjunctive components of dnf
q
q
Boolean Model (cont.) Index term weight variables are all binar
y, I.e. wi,j {0,1} query q is a Boolean expression Let dnf be DNF for query q
Let cc be any conjunctive components of dnf
Similarity of document dj to query q is sim(dj,q) = 1 if cc | ( cc dnf) (ki,gi( j) = gi
( cc)) where gi( j) = wi,j
sim(dj,q) = 0 otherwise
q
q
q
q
q
q
q d
d
Boolean Model (Cont.)
If sim(dj,q) = 1 then Boolean model predict that document dj is relevant to query q (it might not be)
Otherwise, prediction is that document is not relevant
Boolean model predicts that each document is either relevant or non-relevant
no notion of partial match
Main advantages clean formalism simplicity
Main disadvantages exact matching lead to retrieval of too
few or too many documents index term weighting can lead to impr
ovement in retrieval performance
Vector Model Assign non-binary weights to index terms
in queries and documents term weights used to compute degree of
similarity between document and user query
by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents ranked document answer set a lot more preci
se (than answer set by Boolean model)
Vector Model (Cont.)
Weight wi,j for pair (ki, dj) is positive and non-binary
index terms in query are also weighted Let wi,q be weight associated with pair
[ki,q], where wi,q 0
query vector defined as = (w1,q, w2,q, …, wt,q) where t is total no. of index terms in system
vector for document dj is represented by j = (w1,j, w2,j, …, wt,j)
q
q
d
Vector Model (Cont.)
Document dj and user query q represented as t-dimensional vectors.
evaluate degree of similarity of dj with regard to q as correlation between vectors j and .
Correlation can be quantified by cosine of angle between these two vectors sim(dj,q) =
d
q
|||| qjd
qjd
Vector Model (Cont.)
Sim(q,dj) varies from 0 to +1. Ranks documents according to degree of
similarity to query document may be retrieved even if it
partially matches query establish a threshold on sim(dj,q) and retrieve
documents with degree of similarity above threshold
Index term weights Documents are collection C of objects User query is set A of objects IR problem is to determine which
documents are in set A and which are not (I.e. clustering problem)
In clustering problem intra-cluster similarity (features which better
describe objects in set A) inter-cluster similarity (features which better
distinguish objects in set A from remaining objects in collection C
In vector model, intra-cluster similarity quantified by measuring raw frequency of term ki inside document dj (tf factor) how well term describes document contents
inter-cluster dissimilarity quantified by measuring inverse of frequency of term ki
among documents in collection (idf factor) terms which appear in many documents are
not very useful for distinguishing relevant document from non-relevant one
Definition (pp.29) Let N be total no. of documents in system let ni be number of documents in which
index term ki appears
let freqi,j be raw frequency of term ki in document dj
no. of times term ki mentioned in text of document dj
Normalized frequency fi,j of term ki in dj
fi,j =
jfreql
jfreqi
,max
,
Maximum computed over all terms mentioned in text of document dj
if term ki does not appear in document dj then fi,j = 0
let idfi, inverse document frequency for ki be idfi = log
best known term weighting scheme wi,j = fi,j log
ni
N
ni
N
Advantages of vector model term weighting scheme improves retrieval
performance retrieve documents that approximate query
conditions sorts documents according to degree of
similarity to query Disadvantage
index terms are mutually independent
Probabilistic Model Given user query, there is set of
documents containing exactly relevant documents. Ideal answer set
given description of ideal answer set, no problem in retrieving its documents
querying process is process of specifying properties of ideal answer set the properties are not exactly known there are index terms whose semantics are
used to characterize these properties
Probabilistic Model (Cont.) These properties not known at query
time effort has to be made to initially guess
what they (I.e. properties) are initial guess generate preliminary
probabilistic description of ideal answer set to retrieve first set of documents
user interaction initiated to improve probabilistic description of ideal answer set
User examine retrieved documents and decide which ones are relevant
this information used to refine description of ideal answer set
by repeating this process, such description will evolve and be closer to ideal answer set
Fundamental Assumption
Given user query q and document dj in collection, probabilistic model estimate probability that user will find document dj
relevant assumes that probability of relevance depends on
query and document representations only assumes that there is subset of all documents
which user prefers as answer set for query q such ideal answer set is labeled R documents in set R are predicted to be relevant to
query
Given query q, probabilistic model
assigns to each document dj the ratio P(dj relevant-to q)/P(dj non-relevant-to q) measure of similarity to query odds of document dj being relevant to
query q
Index term weight variables are all binary I.e. wi,j {0,1}, wi,q {0,1}
query q is subset of index terms let R be set of documents known (initially
guessed) to be relevant let be complement of R let P(R| j) be probability that document d j
is relevant to query q let P( | j) be probability that document
dj not relevant to query q.
R
d
R d
Similarity sim(dj,q) of document dj to query q is ratio
sim(dj,q) =
sim(dj,q) ~
sim(dj,q) ~ wi,q wi,j
)|(
)|(
jdRP
jdRP
)|(
)|(
RjdP
RjdP
)|(
)|(1log
)|(1
)|(log
RkiP
RkiP
RkiP
RkiP
t
i 1
How to compute P(ki|R) and P(ki| ) initially ? assume P(ki|R) is constant for all index terms
ki (typically 0.5)
P(ki|R) = 0.5
assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection
P(ki| ) = ni/N where ni is no. of documents containing index term ki; N is total no. of doc.
R
R
Let V be subset of documents initially retrieved and ranked by model
let Vi be subset of V composed of documents in V with index term ki
P(ki|R) approximated by distribution of index term ki among doc. retrieved P(ki|R) = Vi / V
P(ki| ) approximated by considering all non-retrieved doc. are not relevant P(ki| ) =
R
VN
Vini
R
Advantages documents ranked in decreasing order of
their probability of being relevant Disadvantages
need to guess initial separation of relevant and non-relevant sets
all index term weights are binary index terms are mutually independent