under the hood [part i] web-based information architectures msec 20-760 – mini ii 28-october-2003...
TRANSCRIPT
Under The Hood [Part I]Web-Based Information Architectures
MSEC 20-760 – Mini II
28-October-2003
Jaime Carbonell
Topics Covered
•The Vector Space Model for IR (VSM)
•Evaluation Metrics for IR
•Query Expansion (the Rocchio Method)
•Inverted Indexing for Efficiency
•A Glimpse into Harder Problems
The Vector Space Model
• Definitions of document and query vectors, where wj = jth word, and c(wj,di) = count the occurrences of wi in document dj
)],(),...,,(),,([
)],(),...,,(),,([
},...,{
21
21
2
iniii
iniii
ni
qwcqwcqwcq
dwcdwcdwcd
wwwVocabulary
Computing the Similarity
• Dot-product similarity:
• Cosine similarity:
ii dqdqsim
),(
i
ii
dq
dqdqsim
),(cos
Computing Norms and Products
• Dot product:
• Eucledian vector norm (aka “2-norm”):
),(),(,...1,...1
,
nj
ijjnj
jiji dwcqwcdqdq
nj
jdd,...1
2
2
Similarity in Retrieval
• Similarity ranking:
If sim(q,di) > sim(q,dj), di ranks higher
• Retrieving top k documents:
)],([max),,( cos jCd
k dqsimArgkCqSearchj
Refinements to VSM (1)Word normalization• Words in morphological root form
countries => countryinteresting => interest
• Stemming as a fast approximationcountries, country => countrmoped => mop
• Reduces vocabulary (always good)• Generalizes matching (usually good)• More useful for non-English IR
(Arabic has > 100 variants per verb)
Refinements to VSM (2)
Stop-Word Elimination
• Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent small words
• Reduce document “length” by 30-40%
• Retrieval accuracy improves slightly (5-10%)
Refinements to VSM (3)
Proximity Phrases• E.g.: "air force" => airforce• Found by high-mutual information
p(w1 w2) >> p(w1)p(w2)
p(w1 & w2 in k-window) >>
p(w1 in k-window) p(w2 in same k-window)
• Retrieval accuracy improves slightly (5-10%)• Too many phrases => inefficiency
Refinements to VSM (4)
Words => Terms
• term = word | stemmed word | phrase
• Use exactly the same VSM method on terms (vs words)
Evaluating Information Retrieval (1)
Contingency table:
relevant not-relevant
retrieved a b
not retrieved c d
Recall = a/(a+c) = fraction of relevant retrieved
Precision = a/(a+b) = fraction of retrieved that is relevant
Evaluating Information Retrieval (2)
P = a/(a+b) R = a/(a+c)
Accuracy = (a+d)/(a+b+c+d)
F1 = 2PR/(P+R)
Miss = c/(a+c) = 1 - R
(false negatives)
F/A = b/(a+b+c+d)
(false positives)
Evaluating Information Retrieval (3)
11-point precision curves
• IR system generates total ranking
• Plot precision at 10%, 20%, 30% ... recall,
Query Expansion (1)Observations:• Longer queries often yield better results• User’s vocabulary may differ from document
vocabularyQ: how to avoid heart diseaseD: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens"
• Maybe longer queries have more chances to help recall.
Query Expansion (2)
Bridging the Gap• Human query expansion (user or expert)• Thesaurus-based expansion
Seldom works in practice (unfocused)• Relevance feedback
– Widen a thin bridge over vocabulary gap– Adds words from document space to query
• Pseudo-Relevance feedback• Local Context analysis
Relevance Feedback:Rocchio’s Method
• Idea: update the query via user feedback
• Exact method: (vector sums)
)},{,( userdqfq retnew
irrelevantrelevantoldnew ddqq
Relevance Feedback (2)For example, if:Q = (heart attack medicine)W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1
Drel = (cardiac arrest prevention medicinenitroglycerine heart disease...)
W(nitroglycerine,D) = 2, W(medicine,D) = 1
Dirr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...)
W(attack,D) = 1, W(nitroglycerine = 2),W(explosive,D) = 1
AND α =1, β =2, γ =.5
Relevance Feedback (3)
Then:
W(attack,Q’) = 1*1 - 0.5*1 = 0.5
W(nitroglycerine, Q’) =
W(medicine, Q’) =
w(explosive, Q’) =
Term Weighting Methods (1)
Salton’s Tf*IDfTf = term frequency in a document
Df = document frequency of term= # documents in collection
with this term
IDf = Df-1
Term Weighting Methods (2)
Salton’s Tf*IDfTfIDf = f1(Tf)*f2(IDf)
E.g. f1(Tf) = Tf*ave(|Dj|)/|D|
E.g. f2(IDf) = log2(IDF)
f1 and f2 can differ for Q and D
Efficient Implementations of VSM (1)
Exploit sparseness
• Only compute non-zero multiplies in dot-products
• Do not even look at zero elements (how?)
• => Use non-stop terms to index documents
Efficient Implementations of VSM (2)
Inverted Indexing• Find all unique [stemmed] terms in document
collection• Remove stopwords from word list• If collection is large (over 100,000 documents),
[Optionally] remove singletonsUsually spelling errors or obscure names
• Alphabetize or use hash table to store list• For each term create data structure like:
Efficient Implementations of VSM (3)[term IDFtermi
,
<doci, freq(term, doci )
docj, freq(term, docj )...>]
or:
[term IDFtermi,
<doci, freq(term, doci), [pos1,i, pos2,i, ...]
docj, freq(term, docj), [pos1,j, pos2,j, ...]...>]
posl,1 indicates the first position of term in documentj and so on.
Open Research Problems in IR (1)
Beyond VSM
• Vectors in different Spaces:
Generalized VSM, Latent Semantic Indexing...
• Probabilistic IR (Language Modeling):
P(D|Q) = P(Q|D)P(D)/P(Q)
Open Research Problems in IR (2)
Beyond Relevance
• Appropriateness of doc to user comprehension level, etc.
• Novelty of information in doc to user anti-redundancy as approx to novelty