under the hood [part i] web-based information architectures msec 20-760 – mini ii 28-october-2003...

Under The Hood [Part I]Web-Based Information Architectures

MSEC 20-760 – Mini II

28-October-2003

Jaime Carbonell

Topics Covered

•The Vector Space Model for IR (VSM)

•Evaluation Metrics for IR

•Query Expansion (the Rocchio Method)

•Inverted Indexing for Efficiency

•A Glimpse into Harder Problems

The Vector Space Model

• Definitions of document and query vectors, where wj = jth word, and c(wj,di) = count the occurrences of wi in document dj

)],(),...,,(),,([

)],(),...,,(),,([

},...,{

21

21

2

iniii

iniii

ni

qwcqwcqwcq

dwcdwcdwcd

wwwVocabulary

Computing the Similarity

• Dot-product similarity:

• Cosine similarity:

ii dqdqsim

),(

i

ii

dq

dqdqsim

),(cos

Computing Norms and Products

• Dot product:

• Eucledian vector norm (aka “2-norm”):

),(),(,...1,...1

,

nj

ijjnj

jiji dwcqwcdqdq

nj

jdd,...1

2

2

Similarity in Retrieval

• Similarity ranking:

If sim(q,di) > sim(q,dj), di ranks higher

• Retrieving top k documents:

)],([max),,( cos jCd

k dqsimArgkCqSearchj

Refinements to VSM (1)Word normalization• Words in morphological root form

countries => countryinteresting => interest

• Stemming as a fast approximationcountries, country => countrmoped => mop

• Reduces vocabulary (always good)• Generalizes matching (usually good)• More useful for non-English IR

(Arabic has > 100 variants per verb)

Refinements to VSM (2)

Stop-Word Elimination

• Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent small words

• Reduce document “length” by 30-40%

• Retrieval accuracy improves slightly (5-10%)


Proximity Phrases• E.g.: "air force" => airforce• Found by high-mutual information

p(w1 w2) >> p(w1)p(w2)

p(w1 & w2 in k-window) >>

p(w1 in k-window) p(w2 in same k-window)

• Retrieval accuracy improves slightly (5-10%)• Too many phrases => inefficiency


Words => Terms

• term = word | stemmed word | phrase

• Use exactly the same VSM method on terms (vs words)

Evaluating Information Retrieval (1)

Contingency table:

relevant not-relevant

retrieved a b

not retrieved c d

Recall = a/(a+c) = fraction of relevant retrieved

Precision = a/(a+b) = fraction of retrieved that is relevant


P = a/(a+b) R = a/(a+c)

Accuracy = (a+d)/(a+b+c+d)

F1 = 2PR/(P+R)

Miss = c/(a+c) = 1 - R

(false negatives)

F/A = b/(a+b+c+d)

(false positives)


11-point precision curves

• IR system generates total ranking

• Plot precision at 10%, 20%, 30% ... recall,

Query Expansion (1)Observations:• Longer queries often yield better results• User’s vocabulary may differ from document

vocabularyQ: how to avoid heart diseaseD: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens"

• Maybe longer queries have more chances to help recall.

Query Expansion (2)

Bridging the Gap• Human query expansion (user or expert)• Thesaurus-based expansion

Seldom works in practice (unfocused)• Relevance feedback

– Widen a thin bridge over vocabulary gap– Adds words from document space to query

• Pseudo-Relevance feedback• Local Context analysis

Relevance Feedback:Rocchio’s Method

• Idea: update the query via user feedback

• Exact method: (vector sums)

)},{,( userdqfq retnew

irrelevantrelevantoldnew ddqq

Relevance Feedback (2)For example, if:Q = (heart attack medicine)W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1

Drel = (cardiac arrest prevention medicinenitroglycerine heart disease...)

W(nitroglycerine,D) = 2, W(medicine,D) = 1

Dirr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...)

W(attack,D) = 1, W(nitroglycerine = 2),W(explosive,D) = 1

AND α =1, β =2, γ =.5

Relevance Feedback (3)

Then:

W(attack,Q’) = 1*1 - 0.5*1 = 0.5

W(nitroglycerine, Q’) =

W(medicine, Q’) =

w(explosive, Q’) =

Term Weighting Methods (1)

Salton’s Tf*IDfTf = term frequency in a document

Df = document frequency of term= # documents in collection

with this term

IDf = Df-1

Term Weighting Methods (2)

Salton’s Tf*IDfTfIDf = f1(Tf)*f2(IDf)

E.g. f1(Tf) = Tf*ave(|Dj|)/|D|

E.g. f2(IDf) = log2(IDF)

f1 and f2 can differ for Q and D

Efficient Implementations of VSM (1)

Exploit sparseness

• Only compute non-zero multiplies in dot-products

• Do not even look at zero elements (how?)

• => Use non-stop terms to index documents

Efficient Implementations of VSM (2)

Inverted Indexing• Find all unique [stemmed] terms in document

collection• Remove stopwords from word list• If collection is large (over 100,000 documents),

[Optionally] remove singletonsUsually spelling errors or obscure names

• Alphabetize or use hash table to store list• For each term create data structure like:

Efficient Implementations of VSM (3)[term IDFtermi

,

<doci, freq(term, doci )

docj, freq(term, docj )...>]

or:

[term IDFtermi,

<doci, freq(term, doci), [pos1,i, pos2,i, ...]

docj, freq(term, docj), [pos1,j, pos2,j, ...]...>]

posl,1 indicates the first position of term in documentj and so on.

Open Research Problems in IR (1)

Beyond VSM

• Vectors in different Spaces:

Generalized VSM, Latent Semantic Indexing...

• Probabilistic IR (Language Modeling):

P(D|Q) = P(Q|D)P(D)/P(Q)


Beyond Relevance

• Appropriateness of doc to user comprehension level, etc.

• Novelty of information in doc to user anti-redundancy as approx to novelty


Beyond one Language

• Translingual IR

• Transmedia IR


Beyond Content Queries

• "What’s new today?"

• "What sort of things to you know about"

• "Build me a Yahoo-style index for X"

• "Track the event in this news-story"

under the hood [part i] web-based information architectures msec 20-760 – mini ii 28-october-2003...

Documents