retrieval models and ranking systems csc 575 intelligent information retrieval
TRANSCRIPT
Retrieval Models and Ranking Systems
CSC 575
Intelligent Information Retrieval
Intelligent Information Retrieval 2
Retrieval Modelsi Model is an idealization or abstraction of an actual process
4 in this case, process is matching of documents with queries, i.e., retrieval i Mathematical models are used to study the properties of the
process, draw conclusions, make predictions 4 Conclusions derived from a model depend on whether the model is a good
approximation to the actual situation
i Retrieval models can describe the computational process 4 e.g. how documents are ranked 4 note that inverted file is an implementation not a model
i Retrieval variables: queries, documents, terms, relevance judgements, users, information needs
i Retrieval models have an explicit or implicit definition of relevance
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
Lexical analysis and stop words
ResultSets
How isthe index
constructed?How is thematching and scoring done?
Intelligent Information Retrieval 4
Retrieval Modelsi Customary to distinguish between exact-match and
best-match retrieval i Exact-match
4 query specifies precise retrieval criteria every document either matches or fails to match query
4 result is a set of documents
i Best-match 4 query describes good or “best” matching document4 result is ranked list of documents 4 result may include estimate of quality
i Best-match models: better retrieval effectiveness 4 good documents appear at top of ranking4 but efficiency is better in exact match (e.g., Boolean)
Intelligent Information Retrieval 5
Ranking Algorithmsi Assign weights to the terms in the queryi Assign weights to the terms in the documentsi Compare the weighted query terms to the weighted
document terms4 Boolean matching (exact match)4 simple (coordinate level) matching4 cosine similarity4 other similarity measures (Dice, Jaccard, overlap, etc.)4 extended Boolean models4 probabilistic models
i Rank order the results4 pure Boolean has no ordering
Intelligent Information Retrieval 6
Boolean Retrieval
i Boolean retrieval most common exact-match model 4 queries are logic expressions with document features as
operands 4 retrieved documents are generally not ranked 4 query formulation difficult for novice users
i “Pure” Boolean operators: AND, OR, NOT i Most systems have proximity operators i Most systems support simple regular
expressions as search terms to match spelling variants
Intelligent Information Retrieval 7
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
Boolean Logici AND and OR in a Boolean query represent intersection and
union of the corresponding documents sets, respectivelyi NOT represents the complement of the corresponding set
Intelligent Information Retrieval 8
Boolean Queriesi Boolean queries are Boolean combination of terms
4 Cat4 Cat OR Dog4 Cat AND Dog4 (Cat AND Dog) OR Collar4 (Cat AND Dog) OR (Collar AND Leash)4 (Cat OR Dog) AND (Collar OR Leash)
i (Cat OR Dog) AND (Collar OR Leash)4 Each of the following combinations works:
Cat x x x x x xDog x x x x xCollar x x xLeash x x x x x x
Intelligent Information Retrieval 9
Boolean Matching
t3
t1 t2
D1D2
D3
D4D5
D6
D8D7
D9
D10
D11
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
Hit list for the query t1 AND t2
{D1, D3, D5, D9, D10, D11} ∩ {D1, D2, D4, D5, D6} = {D1, D5}
Intelligent Information Retrieval 10
Psuedo-Boolean Queries
i A new notation, from web search4 +cat dog +collar leash
i Does not mean the same thing!i Need a way to group combinationsi Phrases:
4 “stray cat” AND “frayed collar”4 +“stray cat” + “frayed collar”
Intelligent Information Retrieval 11
Faceted Boolean Query
i Strategy: break query into facets
4 conjunction of disjunctions (conjunctive normal form)a1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
4 each facet expresses a topic or concept
“rain forest” OR jungle OR amazon
medicine OR remedy OR cure
research OR development
AND
AND
Intelligent Information Retrieval 12
Faceted Boolean Query
i Query still fails if one facet missing
i Alternative: a form of Coordination level ranking4 Order results in terms of how many facets (disjuncts) are satisfied4 Also called Quorum ranking
i Problem: Facets still undifferentiated4 Alternative: assign weights to facets
Intelligent Information Retrieval 13
Boolean Modeli Advantages
4 simple queries are easy to understand4 relatively easy to implement4 structured queries4 queries can be automatically translated into CNF or DNF
i Disadvantages4 difficult to specify what is wanted4 too much returned, or too little (acceptable precision generally
means unacceptable recall)4 ordering not well determined4 query formulation difficult for novice users
i Dominant language in commercial systems until the WWW
Intelligent Information Retrieval 14
Vector Space Model (revisited)i Documents are represented as “bags of words”i Represented as vectors when used computationally
4 A vector is an array of floating point (or binary in case of bit maps)4 Has direction and magnitude4 Each vector has a place for every term in collection (most are sparse)
nova galaxy heat actor film role
A 1.0 0.5 0.3
B 0.5 1.0
C 1.0 0.8 0.7
D 0.9 1.0 0.5
E 1.0 1.0
F 0.7
G 0.5 0.7 0.9
H 0.6 1.0 0.3 0.2
I 0.7 0.5 0.3
Document Ids
a documentvector
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
Intelligent Information Retrieval 15
Documents & Query in n-dimensional Space
i Documents are represented as vectors in term space4 Terms are usually stems4 Documents represented by binary vectors of terms
i Queries represented the same as documentsi Query and Document weights are based on length and direction
of their vectori A vector distance measure between the query and documents is
used to rank retrieved documents
Intelligent Information Retrieval 16
The Notion of “Similarity” in IRi The notion of similarity is central to many aspects of
information retrieval and filtering:4 measuring similarity of the query to documents is the
primary factor in determining what is returned (and how they are ranked)
4 similarity measures can also be used in clustering documents (I.e., grouping together documents with similar content)
4 the same similarity measures can also be used to group together related terms (based on their occurrence patterns across documents in the collection)
Intelligent Information Retrieval 17
Vector-Based Similarity Measuresi Simple Matching and Cosine Similarity
4 Simple matching = dot product of two vectors
4 Cosine Similarity = normalized dot product
4 the norm of a vector X is:
4 the cosine similarity of vectors X and Y is:
i
ixX 2
ii
ii
iii
yx
yx
yX
YXYXsim
22
)(),(
1 2, , , nX x x x
1 2, , , nY y y y
iii yxYXYXsim ),(
In other words, divide the dot product by the norms of the two vectors
Intelligent Information Retrieval 18
Vector-Based Similarity Measures
i Why divide by the norm?
4 Example:i X = <2, 0, 3, 2, 1, 4>
i ||X|| = SQRT(4+0+9+4+1+16) = 5.83
i X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>
4 Now, note that ||X*|| = 1
4 So, dividing a vector by its norm, turns it into a unit-length vector4 Cosine similarity measures the angle between two unit length vectors (i.e., the
magnitude of the vectors are ignored).
1 2, , , nX x x x i
ixX 2
Intelligent Information Retrieval 19
Computing a similarity score2D Example
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
DQsim
D
Q
Intelligent Information Retrieval 20
Computing Similarity Scores
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
Intelligent Information Retrieval 21
Other Vector Space Similarity Measures
Simple Matching:
Cosine Coefficient:
Dice’s Coefficient:
Jaccard’s Coefficient:
1
( , ) ( )j j
t
q dj
sim Q D w w
2 2
1 1
1
( ) ( )
( )( , )
j j
j j
t t
q dj j
t
q dj
w w
w wsim Q D
2 2
1 1
1
( ) ( )
2 ( )( , )
j j
j j
t t
q dj j
t
q dj
w w
w wsim Q D
2 2
1 1
1
1
( ) ( )
( )( , )
( )j j
j j
j j
t t
q dj j
t
q dj
t
q dj
w w
w wsim Q D
w w
Intelligent Information Retrieval 22
Vector Space Similarity Measuresi Again consider the following two document and the query vectors:
D1 = (0.8, 0.3)D2 = (0.2, 0.7)
Q = (0.4, 0.8)
i Computing similarity using Jaccard’s Coefficient:
sim Q D( , )[( . . ) ( . . )]
.[( . ) ( . ) ] [( . ) ( . ) ] .1 2 2 2 2
0 4 08 08 0 3058
0 4 08 0 8 0 3 0 56
sim Q D( , )[( . . ) ( . . )]
.[( . ) ( . ) ] [( . ) ( . ) ] .2 2 2 2 2
0 4 0 2 08 0 70 93
0 4 0 8 0 2 0 7 0 64
i Computing similarity using Dice’s Coefficient:
sim(Q, D1) = 0.73 sim(Q, D2) = 0.96
Intelligent Information Retrieval 23
Vector Space Similarity MeasuresExample
docs t1 t2 t3 Q.Di |Di| Dice Jaccard CosineD1 2 0 3 2 13 0.22 0.13 0.25D2 1 0 0 1 1 0.33 0.20 0.45D3 0 2 1 4 5 0.80 0.67 0.80D4 4 0 0 4 16 0.38 0.24 0.45D5 1 2 3 5 14 0.53 0.36 0.60D6 2 1 0 4 5 0.80 0.67 0.80D7 0 3 1 6 10 0.80 0.67 0.85D8 0 1 0 2 1 0.67 0.50 0.89D9 2 0 1 2 5 0.40 0.25 0.40
D10 0 4 2 8 20 0.64 0.47 0.80D11 6 1 8 37 0.38 0.24 0.59Q 1 2 0 5 5
Intelligent Information Retrieval 24
Vector Space Similarity MeasuresExample
D3 0.67 D8 0.89D6 0.67 D7 0.85D7 0.67 D10 0.80D8 0.50 D3 0.80
D10 0.47 D6 0.80D5 0.36 D5 0.60D9 0.25 D2 0.45
D11 0.24 D4 0.45D4 0.24 D9 0.40D2 0.20 D1 0.25D1 0.13 D11 0.59
Ranking Using Jaccard
Ranking Using Cosine
Intelligent Information Retrieval 25
Probabilistic Models
i Attempts to be more theoretically sound than the vector space model4 try to predict the probability of a document’s being relevant,
given the query4 there are many variations4 usually more complicated to compute than v.s.4 usually many approximations are required
i Relevance information is required from a random sample of documents and queries (training examples)
i Works about the same (sometimes better) than vector space approaches
Intelligent Information Retrieval 26
Basic Probabilistic Retrieval
i Retrieval is modeled as a classification process i Two classes for each query: the relevant and non-relevant
documents (with respect to a given query)4 could easily be extended to three classes (i.e. add a don’t care)
i Given a particular document D, calculate the probability of belonging to the relevant class4 retrieve if greater than probability of belonging to non-relevant class 4 i.e. retrieve if P(R|D) > P(NR|D)
i Equivalently, rank by a discriminant value (also called likelihood ratio) P(R|D) / P(NR|D)
i Different ways of estimating these probabilities lead to different models
Intelligent Information Retrieval 27
Basic Probabilistic Retrieval
i A given query divides the document collection into two sets: relevant and non-relevant
RelevantDocuments
Non-RelevantDocumentsDocument
P(R|D)
P(NR|D)
4 If a document set D has been selected in response to a query, retrieve the document if
dis(D) > 1
where
dis(D) = P(R|D) / P(NR|D)4 is the discriminant of D
4 This criteria can be modified by weighting the two probabilities
Intelligent Information Retrieval 28
Estimating Probabilitiesi Bayes’ Rule can be used to “invert” conditional probabilities:
i Applying that to discriminant function:
i Note that P(R) is the probability that a random document is relevant to the query, and P(NR) = 1 - P(R)
P(R) = n / N and P(NR) = 1 - P(R) = (N - n) / N
where n = number of relevant documents, and
N = total number of documents in the collection
P A BP B A P A
P B( | )
( | ). ( )
( )
dis DP R D
P NR D
P D R P R
P D NR P NR( )
( | )
( | )
( | ). ( )
( | ). ( )
Intelligent Information Retrieval 29
Estimating Probabilities
i Now we need to estimate P(D|R) and P(D|NR)4 If we assume that a document is represented by terms t1, . . ., tn, and that
these terms are statistically independent, then
4 and similarly we can compute P(D|NR)
4 Note that P(ti|R) is the probability that a term ti occurs in a relevant document, and it can be estimated based on previously available sample (e.g., through relevance feedback)
4 So, based on the probability of the distribution of terms in relevant and non-relevant documents we can estimate whether the document should be retrieved (i.e, if dis(D) > 1)
4 Note that documents that are retrieved can be ranked based on the value of the discriminant
dis DP R D
P NR D
P D R P R
P D NR P NR( )
( | )
( | )
( | ). ( )
( | ). ( )
1 2( | ) ( | ) ( | ) ( | )nP D R P t R P t R P t R
Intelligent Information Retrieval 30
Probabilistic Retrieval - Example
t1 t2 t3 t4 t5 Relevance to Q
D1 1 1 0 1 0 RD2 0 1 1 0 0 RD3 1 0 1 0 1 NRD4 1 1 1 1 0 NRD5 0 1 0 1 0 NRD6 0 0 0 1 1 RD7 0 1 0 0 0 NRD8 1 1 0 1 0 NRD9 0 0 1 1 1 R
D10 1 0 1 0 1 NRD 1 0 0 1 1
Term P(t|R) P(t|NR)
t1 1/4 4/6t2 2/4 4/6t3 2/4 3/6t4 3/4 3/6t5 2/4 2/6
( | ). ( ) ( 1| ) ( 4 | ) ( 5 | ) ( )( )
( | ). ( ) ( 1| ) ( 4 | ) ( 5 | ) ( )
P D R P R P t R P t R P t R P Rdis D
P D NR P NR P t NR P t NR P t NR P NR
0.25 0.75 0.50 0.40( ) 0.75
0.67 0.50 0.33 0.60dis D
Since the discriminant is less than one, document D should not be retrieved
Intelligent Information Retrieval 31
Probabilistic Retrieval (cont.)
Term P(t|R) P(t|NR)
t1 1/4 4/6t2 2/4 4/6t3 2/4 3/6t4 3/4 3/6t5 2/4 2/6
i In practice, can’t build a model for each query4 Instead a general model is built based on query-document pairs in the
historical (training) data4 Then for a given query Q, the discriminant is computed only based on the
conditional probabilities of the query terms4 If query term t occurs in D, take P(t|R) and P(t|NR)4 If query term t does not appear in D, take 1-P(t|R) and 1- P(t|NR)
Q = t1, t3, t4 D = t1, t4, t5
)(
)(
)|4()).|3(1).(|1(
)|4()).|3(1).(|1(),(
NRP
RP
NRtPNRtPNRtP
RtPRtPRtPQDdis
373.06.0
4.0
5.0)5.01(67.0
75.0)5.01(25.0),(
QDdis
Intelligent Information Retrieval 32
Probabilistic Models
i Strong theoretical basisi In principle should supply the
best predictions of relevance given available information
i Can be implemented similarly to Vector
i Relevance information is required -- or is “guestimated”
i Important indicators of relevance may not be term -- though terms only are usually used
i Optimally requires on-going collection of relevance information
Advantages Disadvantages
Intelligent Information Retrieval 33
Vector and Probabilistic Models
i Support “natural language” queriesi Treat documents and queries the samei Support relevance feedback searchingi Support ranked retrievali Differ primarily in theoretical basis and in how the
ranking is calculated4 Vector assumes relevance 4 Probabilistic relies on relevance judgments or estimates
Intelligent Information Retrieval 34
Extended Boolean Models
i Weighted Boolean Queries4 Weights are assigned to the operands in Boolean query
A0.6 AND B0.75 A1.0 OR B0.3
4 The weighting operation depends on the distance between document sets for A and Bi a weight of 1.0 says that all of the corresponding document set is
considered in the operationi a weight of 0 < w < 1 says that only a portion of the document set is
consideredi the documents added or deleted are those that are “closest” to the
current set of documents
Intelligent Information Retrieval 35
Weighted Boolean Queries
A1.0 AND B1.0 = A Ç B A1.0 OR B1.0 = A È B
A1.0 AND B0.0 = A A1.0 OR B0.0 = A
A1.0 OR B.75 =
A (È 75% of B - )A
A1.0 AND B.75 =
(A Ç B) (È 25% of - )A B
A
B
A
B
Intelligent Information Retrieval 36
Weighted Boolean Queriesi Matching Algorithm
1. Find initial matching set (non-weighted Boolean document set)
2. Find the invariant document set (set of documents that are present both when operand weight is 1.0 and when the weight is 0.0); the optional set is the remaining items
3. Compute the centroid of the invariant set
4. Find the number of documents, say k, from the optional set that will potentially be added to the invariant set (determined by the weight of the query term)
5. Compute similarity between documents in the optional set and the centroid (of the invariant set)
6. Items to be added or deleted are the top k documents in the optional set with the highest similarity scores
Intelligent Information Retrieval 37
Demo of Extended Boolean Query*
http://ir.exp.sis.pitt.edu/res2/data/66/
* Thanks to Michael Bombyk for discovering this applet!
Intelligent Information Retrieval 38
Weighted Boolean Queries - ExampleQ1(initial) = (D1, D2, D3, D4, D5, D6, D8)
Q1(invariant) = (D3, D6, D8)
Q1(optional) = (D1, D2, D4, D5) => 4 items
No. selected docs. = Centroid(Q1) = (1/3)
= (4.7, 0.7, 2.0, 2.0)
Computing Similarity (using simple matching):
SIM(Centroid,D1) = (4.7,0.7,2.0,2.0).(0,4,0,8) = 18.8
SIM(Centroid,D2) = (4.7,0.7,2.0,2.0).(0,2,0,0) = 1.4
SIM(Centroid,D4) = (4.7,0.7,2.0,2.0).(0,6,4,6) = 24.2
SIM(Centroid,D5) = (4.7,0.7,2.0,2.0).(0,4,6,4) = 22.8
So the final Hit list is : (D3, D6, D8) È (D4, D5)
A B C DD1 0 4 0 8D2 0 2 0 0D3 4 0 2 4D4 0 6 4 6D5 0 4 6 4D6 6 0 4 0D7 0 0 0 0D8 4 2 0 2
Query
Q1 = A1.0 OR B.333
. ( .333 4 1332 2 items) items
4 + 6 + 4,0 + 0 + 2,2 + 4 + 0,4 + 0 + 2
Intelligent Information Retrieval 39
Weighted Boolean Queries - Example
Query
Q2 = C.75 AND D1.0
A B C DD1 0 4 0 8D2 0 2 0 0D3 4 0 2 4D4 0 6 4 6D5 0 4 6 4D6 6 0 4 0D7 0 0 0 0D8 4 2 0 2
Q2(initial) = (D3, D4, D5)
Q2(invariant) = (D3, D4, D5)
Q2(optional) = (D1, D8) => 2 items
No. selected docs. = Centroid(Q2) = (1/3)
=
Computing Similarity (using simple matching):
SIM(Centroid,D1) =
SIM(Centroid,D8) =
Final Hit list is: (D3, D4, D5) È (D1)
. (25 2 1 items) item
4 + 0 + 0,0 + 6 + 4,2 + 4 + 6,4 + 6 + 4
1.3, 3.3, 4.0, 4.7
1.3, 3.3, 4.0, 4.7 0 4 0 8 508, , , .
1.3, 3.3, 4.0, 4.7 4 2 0 2 212, , , .