query operations: automatic local analysis. introduction difficulty of formulating user queries...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Query Operations:Automatic Local Analysis
Introduction
Difficulty of formulating user queries– Insufficient knowledge of the collection– Insufficient knowledge of the retrieval
environment
Query reformulation– two basic steps
• query expansion– Expanding the original query with new terms
• term reweighting– Reweighting the terms in the expanded query
Automatic Relevance Feedback
Basic idea– clustering: known relevant documents contain
terms which can be used to describe a larger cluster of relevant documents.
– obtain a description for a larger cluster of relevant documents automatically.
• identifying terms which are related to the query terms
• synonyms, stemming variations, terms which are close to the query terms in the text, etc.
– global analysis vs. local analysis
Global vs. Local Analysis
Global analysis– all documents in the collection are used to determine
a global thesaurus-like structure which defines term relationships
– this structure can be shown to the user who selects clusters or terms for query expansion
Local analysis– the documents retrieved for a given query q are
examined at query time to determine terms for query expansion
– local clustering and local context analysis: without assistance from the user
Local Clustering
Operate solely on the documents retrieved for the current query
Valuable because term distributions are not uniform across topic areas– Distinguishing terms are different for different topics– Global techniques cannot take these differences into
account.
Requires significant run-time computation– Not for Web search engines due to cost– Useful in intranet environments and for specialized
document collections
Local ClusteringInitially use stemming to group terms
– For stem s = polish– V(s) = {polish, polishing, polished}
Definitions– q: query– Dl : local document set (retrieved documents)– Vl : vocabulary of Dl
– Sl : set of distinct stems for Vl
Three types of clusters– association clusters– metric clusters– scalar clusters
Association Clusters
Idea: Terms which co-occur frequently inside documents likely relate the same concept.
Simple computation based on the frequency of co-occurrence of terms inside documents
– correlation between the stems
Association Clusters
Definitions– Matrix m has |Sl| rows and |Dl| columns
• mij = fsi,j (frequency of stem si in document dj)
– Correlation cu,v is computed as:
– Matrix s = mmt
• Unnormalized– su,v = cu,v
• Normalized– su,v = cu,v / (cu,u + cv,v – cu,v)
Association Clusters
Selecting Clusters– Normally want stems for each query term– Need clusters to be small in order to retain
focus– Select fixed size of clusters, n
To expand query q– Construct cluster for each query term q
• Identify sq, the stem for query term q• For stem sq, select the top n values sq,v
– Union of all query term clusters is expanded query
Metric Clusters
Two terms which are near one another are more likely to be correlated than two terms which occur far apart– factor in the distance between two terms in
the computation of their correlation factor
Same as Association Clusters except for the computation of cu,v
Metric Clusters
Same as Association Clusters except for the computation of cu,v
Correlation between the stems su and sv
Where r (ki, kj) = distance between keywords in the same document
This is unnormalized. Can be normalized.
Scalar Clusters
Idea: two stems with similar neighborhoods have some synonymity relationship
The relationship is indirect or induced by the neighborhood.
Quantifying such neighborhood relationships– Arrange all correlation values su,i in a vector
– Arrange all correlation values sv,i in another vector
– Compare these vectors through a scalar measure– The cosine of the angle between the two vectors is a
popular scalar similarity measure.
Clustering Approaches
In practice:– Metric clusters outperform association
clusters– Using a combination of normalized and
unnormalized correlation factors can be beneficial
• Unnormalized factors tend to group stems due to large frequencies
• Normalized factors tend to group stems which are more rare
Clustering Approaches
Local approaches use the frequencies and correlations of terms and stems within the set of documents retrieved– These frequencies and correlations may not
be representative of the overall collection– How is this good? How is this bad?
Query Operations:Automatic Global Analysis
Motivation
Methods of local analysis extract information from local set of documents retrieved to expand the query
An alternative is to expand the query using information from the whole set of documents
Until the beginning of the 1990s these techniques failed to yield consistent improvements in retrieval performance
Now, with moderns variants, sometimes based on a thesaurus, this perception has changed
Automatic Global Analysis
There are two modern variants based on a thesaurus-like structure built using all documents in collection– Query Expansion based on a Similarity
Thesaurus– Query Expansion based on a Statistical
Thesaurus
Similarity Thesaurus
The similarity thesaurus is based on term-to-term relationships rather than on a matrix of co-occurrence.– These relationships are not derived directly from co-
occurrence of terms inside documents.– They are obtained by considering that the terms are
concepts in a concept space.– In this concept space, each term is indexed by the
documents in which it appears.
Terms assume the original role of documents while documents are interpreted as indexing elements
Similarity Thesaurus vs. Vector Model
The frequency factor:– In vector model
f (i,j) = freq ( term ki in doc dj ) / freq ( most common term in dj )
– In similarity thesaurusf (i,j) = freq ( term ki in doc dj ) / freq ( doc where term ki appears
most)– Normalized based on document where the term appears most.
The inverse frequency factor:– In vector model
Idf (i) = log (# of docs in collection / # of docs with term ki)
– In similarity thesaurusItf (j) = log (# of terms in collection / # of terms in doc dj)
– Calculates how good a discriminator is this document?
Similarity ThesaurusDefinitions:
– t: number of terms in the collection
– N: number of documents in the collection
– fi,j: frequency of occurrence of the term ki in the document dj
– tj: vocabulary of document dj
– itfj: inverse term frequency for document dj
Inverse term frequency for document dj
For each term ki
where wi,j is a weight associated between the term and the documents.
N
lj
lil
li
jjij
ji
ji
itff
f
itff
f
w
1
22
,
,
,
,
,
))(max
5.05.0(
))(max
5.05.0(j
j t
titf log
),....,,(k ,2,1,i Niii www
Similarity Thesaurus
The relationship between two terms ku and kv is computed as a correlation factor cu,v given by
The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection
The computation is expensive but only has to be computed once and can be updated incrementally
jd
jv,ju,vuvu, wwkkc
Query Expansion based on a Similarity Thesaurus
Query expansion is done in three steps as follows: Represent the query in the concept space
used for representation of the index terms2 Based on the global similarity thesaurus,
compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.
3 Expand the query with the top r ranked terms according to sim(q,kv)
Statistical Thesaurus
Global thesaurus is composed of classes which group correlated terms in the context of the whole collection– Such correlated terms can then be used to expand
the original user query– These terms must be low frequency terms– However, it is difficult to cluster low frequency terms – To circumvent this problem, we cluster documents
into classes instead and use the low frequency terms in these documents to define our thesaurus classes.
– This algorithm must produce small and tight clusters.
Complete Link Algorithm
Document clustering algorithm– Place each document in a distinct cluster.– Compute the similarity between all pairs of clusters.
– Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity.
– Merge the clusters Cu and Cv
– Verify a stop criterion. If this criterion is not met then go back to step 2.
– Return a hierarchy of clusters.
Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents– Use of minimum ensures small, focussed clusters
Generating the Thesaurus
Given the document cluster hierarchy for the whole collection– Which clusters become classes?– Which terms represent classes?
Answers based on three parameters specified by operator based on characteristics of the collection– TC: Threshold class– NDC: Number of documents in class– MIDF: Minimum inverse document frequency
Selecting Thesaurus Classes
TC is the minimum similarity between two subclusters for the parent to be considered a class.– A high value makes classes smaller and
more focussed.
NDC is an upper limit on the size of clusters.– A low value of NDC restricts the selection to
smaller, more focussed clusters
Picking Terms for Each Class
Consider the set of documents in each class selected above
Only the lower frequency terms are used for the thesaurus classes
The parameter MIDF defines the minimum inverse document frequency for a term to represent the thesaurus class
Initializing TC, NDC, and MIDF
TC depends on the collection – Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting of TC.
– A high value of TC might yield classes with too few terms
– A low value of TC yields too few classes
NDC is easier to set once TC is set
MIDF can be difficult to set
Query Expansion with Statisitcal Thesaurus
Adding Terms:– Use the terms in same class at terms in
query
Weights of new terms can be based on both – the original query term weights (if any)
and – on the degree to which a term represents
the class of the query term
Conclusions
Automatically generated thesaurus is a method to expand queries
Thesaurus generation is expensive but it is executed only once
Query expansion based on similarity thesaurus uses term frequencies to expand the query
Query expansion based on statistical thesaurus uses document clustering and needs well defined parameters