chapter 5: query operations
DESCRIPTION
Chapter 5: Query Operations. Baeza-Yates, 1999 Modern Information Retrieval. Query Modification. Improving initial query formulation Relevance feedback approaches based on feedback information from users Local analysis - PowerPoint PPT PresentationTRANSCRIPT
Chapter 5: Query Operations
Baeza-Yates, 1999Modern Information Retrieval
Query Modification Improving initial query formulation
Relevance feedback• approaches based on feedback information from users
Local analysis • approaches based on information derived from the set of
documents initially retrieved (called the local set of documents)
Global analysis• approaches based on global information derived from the
document collection
Relevance Feedback Relevance feedback process
it shields the user from the details of the query reformulation process it breaks down the whole searching task into a sequence of small steps which are easier to grasp it provides a controlled process designed to emphasize some terms and de-emphasize others
Two basic techniques Query expansion
• addition of new terms from relevant documents Term reweighting
• modification of term weights based on the user relevance judgement
Vector Space Model Definitionwi,j: the ith term in the vector for document djwi,k: the ith term in the vector for query qkt: the number of unique terms in the data set
t
i
kijikj wwqdsimilarity1
,,),(),,,( ,,2,1 jtjjj wwwd ),,,( ,,2,1 ktkkk wwwq
t
k ktftf
itftf
jiidf
idfw
jkk
jk
jkk
ji
122
}{max
}{max,
)5.05.0(
)5.05.0(
,
,
,
,
Query Expansion and and Term Reweighting for the Vector Model Ideal situation
CR: set of relevant documents among all documents in the collection
Rocchio (1965, 1971) R: set of relevant documents, as identified by the user among the retrieved documents S: set of non-relevant documents among the retrieved documents
RjRj Cd j
RCd j
Ropt d
CNd
Cq
||1
||1
Sd jRd jm jjd
Sd
Rqq
||||
Rocchio’s Algorithm Ide_Regular (1971) Ide_Dec_Hi Parameters
= = =1 >
}|{ SddMaxdqq jjRd jm j
Sd jRd jm jjddqq
Probabilistic Model Definition
pi: the probability of observing term ti in the set of relevant documents qi: the probability of observing term ti in the set of nonrelevant documents
Initial search assumption pi is constant for all terms ti (typically 0.5) qi can be approximated by the distribution of ti in the whole collection
t
i ii
iiqijij pq
qpwwqdsim1
,, )1()1(log),(
iii
i
ii
iii idf
dfN
dfdfN
pqqpwt
log)(log)1()1(log
Term Reweighting for the Probabilistic Model Robertson and Sparck Jones (1976) With relevance feedback from userN: the number of documents in the collectionR: the number of relevant documents for query qni: the number of documents having term tiri: the number of relevant documents having term ti
Document Relevance
DocumentIndexing
+
-
+
ri
R-ri
R
N-ni-R+ri
-
ni-ri
N-R
ni
N-ni
N
Initial search assumptionpi is constant for all terms ti (typically 0.5)qi can be approximated by the distribution of ti in the whole collection
With relevance feedback from userspi and qi can be approximated by
hence the term weight is updated by
)(Rrp i
i )(RNrnq ii
i
t
i i
iqijij n
nNwwqdsim1
,, log),(
t
i iii
iiiqijij rnrR
rRnNrwwqdsim1
,, ))(()(log),(
Term Reweighting for the Probabilistic Model (cont.)
However, the last formula poses problems for certain small values of R and ri (R=1, ri=0)
Instead of 0.5, alternative adjustments have been propsed
)15.0(
R
rp ii )
15.0(
RN
rnq iii
)1
(
Rr
p Nn
ii
i
)1
(
RNrn
q Nn
iii
i
Term Reweighting for the Probabilistic Model (Cont.)
Characteristics Advantage
• the term reweighting is optimal under the asumptions of • term independence • binary document indexing (wi,q {0,1} and wi,j {0,1})
Disadvantage• no query expansion is used• weights of terms in the previous query formulations are also disregarded• document term weights are not taken into account during the feedback loop
Term Reweighting for the Probabilistic Model (Cont.)
Evaluation of relevance feedback
Standard evaluation method is not suitable (i.e., recall-precision) because the relevant documents used to reweight the query terms are moved to higher ranks.
The residual collection method the set of all documents minus the set of feedback documents provided by the user because highly ranked documents are removed from the collection, the recall-precision figures for tend to be lower than the figures for the original query as a basic rule of thumb, any experimentation involving relevance feedback strategies should always evaluate recall-precision figures relative to the residual collection
mqq
Automatic Local Analysis Definition
local document set Dl : the set of documents retrieved by a query local vocabulary Vl : the set of all distinct words in Dl stemed vocabulary Sl : the set of all distinct stems derived from Vl
Building local clusters association clusters metric clusters scalar clusters
Association Clusters Idea
co-occurrence of stems (or terms) inside documents
• fu,j: the frequency of a stem ku in a document dj local association cluster for a stem ku
• the set of k largest values c(ku, kv) given a query q, find clusters for the |q| query terms normalized form
||
1,,),(
D
jjvjuvu ffkkc
),(),(),(),(),(
vuvvuu
vuvu kkckkckkc
kkckks
Metric Clusters Idea
consider the distance between two terms in the same cluster Definition
V(ku): the set of keywords which have the same stem form as ku distance r(ki, kj)=the number of words between term ku and kv
normalized form
)( )( ),(
1),(u vkVi kVj ji
vu kkrkkc
|)(||)(|),(),(
vu
vuvu kVkV
kkckks
Scalar Clusters Idea
two stems with similar neighborhoods have some synonymity relationships Definition
cu,v=c(ku, kv) vectors of correlation values for stem ku and kv
scalar association matrix
scalar clusters• the set of k largest values of scalar association
),,,( ,2,1, tuuuu cccs ),,,( ,2,1, tvvvv cccs
||||,
vu
vuvu
ssssS
Automatic Global Analysis A thesaurus-like structure Short history
Until the beginning of the 1990s, global analysis was considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections
This perception has changed with the appearance of modern procedures for global analysis
Query Expansion based on a Similarity Thesaurus
Idea by Qiu and Frei [1993] Similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence Terms for expansion are selected based on their similarity to the whole query rather than on their similarities to individual query terms
Definition N: total number of documents in the collection t: total number of terms in the collection tfi,j: occurrence frequency of term ki in the document dj tj: the number of distinct index terms in the document dj itfj : the inverse term frequency for document dj
jj t
titf log
Similarity Thesaurus Each term is associated with a vector
where wi,j is a weight associated to the index-document pair
The relationship between two terms ku and kv is
Note that this is a variation of the correlation measure used for computing scalar association matrices
),,,( ,2,1, Niii wwwki
N
k ktftf
jtftf
jiitf
itfw
kik
ki
kik
ji
122
}{max
}{max,
)5.05.0(
)5.05.0(
,
,
,
,
N
jjvjuvuvu wwkkc
1,,,
Term weighting vs. Term concept space
tfij
Term ki
Doc dj tfijTerm ki
Doc dj
t
k ktftf
itftf
jiidf
idfw
jkk
jk
jkk
ji
122
}{max
}{max,
)5.05.0(
)5.05.0(
,
,
,
,
N
k ktftf
jtftf
jiitf
itfw
kik
ki
kik
ji
122
}{max
}{max,
)5.05.0(
)5.05.0(
,
,
,
,
Query Expansion Procedure with Similarity Thesaurus
1. Represent the query in the concept space by using the representation of the index terms
2. Compute the similarity sim(q,kv) between each term kv and the whole query
3. Expand the query with the top r ranked terms according to sim(q,kv)
uqk
kwqu
qu
,
vuQk
quvqk
uquvv cwkkwkqkqsimuu
,,,),(
qk qu
vqv
uwkqsimw
,',
),(
Example of Similarity ThesaurusThe distance of a given term kv to the query centroid QC might be quite distinct from the distances of kv to the individual query terms
ka kb
ki
kj
kv
QC
QC={ka ,kb}
Query Expansion based on a Similarity Thesaurus A document dj is represented term-concept space by
If the original query q is expanded to include all the t index terms, then the similarity sim(q, dj) between the document dj and the query q can be computed as
• which is similar to the generalized vector space model
jv u
jvu
dkvu
qkqujvj
dkvjv
qkuquj
cwwdqsim
kwkwdqsim
,,,
,,
),(
),(
jv dk
vjvj kwd ,
Query Expansion based on a Statistical Thesaurus
Idea by Crouch and Yang (1992) Use complete link algorithm to produce small and
tight clusters Use term discrimination value to select terms for
entry into a particular thesaurus class Term discrimination value
A measure of the change in space separation which occurs when a given term is assigned to the document collection
Term Discrimination Value Terms
good discriminators: (terms with positive discrimination values)• index terms
indifferent discriminators: (near-zero discrimination values)• thesaurus class
poor discriminators: (negative discrimination values)• term phrases
Document frequency dfk dfk >n/10: high frequency term (poor discriminators) dfk <n/100: low frequency term (indifferent discriminators) n/100 dfk n/10: good discriminator
Statistical Thesaurus Term discrimination value theory
the terms which make up a thesaurus class must be indifferent discriminators
The proposed approach cluster the document collection into small, tight clusters A thesaurus class is defined as the intersection of all
the low frequency terms in that cluster documents are indexed by the thesaurus classes the thesaurus classes are weighted by
||
||
1 ,
Cw
wtC
i CiC
Discussion Query expansion
useful little explored technique
Trends and research issues The combination of local analysis, global analysis,
visual displays, and interactive interfaces is also a current and important research problem