efficient processing of complex features for information retrieval dissertation by trevor strohman...

Efficient Processing of Complex Features for Information Retrieval

Dissertation by Trevor StrohmanPresented by

Laura C. VandivierFor ITCS6050, UNCC, Fall 2008

Overview

• Indexing• Ranking• Query Expansion• Query Evaluation• Tupleflow

Topics Not Covered

• Binned Probabilities• Score-Sorted Index Optimization• Document-Sorted Index

Optimization• Navigational Search with Complex

Features

Document Indexing

• Inverted ListA mapping from a single word to a set of

documents that contain the word

• Inverted IndexA set of inverted lists

Inverted Index

• Contain one inverted list for each term in the document collection

• Often omit frequently occurring words such as “a,” “and” and “the.”

Inverted Index Example

Sample Documents1. Cats, dogs, dogs.2. Dogs, cats, sheep.3. Whales, sheep, goats.4. Fish, whales, whales.

Inverted Indexcats

dogs

fish

goats

sheep

whales

1 1 4 3 2 3

2 2 3 4

QueryAnswer

cats 1,2

sheep + dogs

2

Expanding Inverted Indexes

• Include term frequencyMore terms implies “about”

cats

dogs

fish goats

sheep

whales

(1,1)

(1,2)

(4,1)

(3,1)

(2,1)

(3,1)

(2,1)

(2,1)

(3,1)

(4,2)

Expanding Inverted Indexes (cont.)

• Add word position informationFacilitates phrase searching

cats dogs fish goats sheep

whales

(1,1): 1

(1,2): 2,3

(4,1): 1

(3,1): 2

(2,1): 3

(3,1): 1

(2,1): 2

(2,1): 1 (3,1): 2

(4,2): 1

Inverted Index Statistics

• Compressed inverted indexes containing only word counts– 5% of the document collection in size– Built and queried faster

• Compressed inverted indexes containing word counts and positions– 20% of the document collection in size– Essential for high effectiveness, even in queries

not using phrases

Document Ranking

• Documents returned in order of relevance

• Perfect ranking impossible• Retrieval systems calculate

probability a document is relevant

Computing Relevance

• Assume “bag of words” with term independence

• Simple estimation

• Problems1. If a document does not contain all words of a multi-

word query it will not be retrieved.document containing 0 words = document containing some

words

2. All words are treated equally.Query = Maltese falcondocument(maltese:2, falcon:1) = document(maltese:1,falcon:2)for documents of similar length

• Smoothing can help

# occurrences

document length

Computing Relevance (cont.)

• Add additional features– Position/field in document, ex.

title– Proximity of query terms– Combinations

Computing Relevance (cont.)

Add query independent information• # links from other documents• URL depth

shorter generallonger specific

• User clicksMay match expectations but not relevance

• Dwell time• Document quality models

Unusual term distribution implies poor grammar so the document is not a good retrieval candidate

Query Expansion

StemmingGroups words that mean the same concept based on

natural language rules. ex: run, runs, running, ran

• Aggressive StemmerMay group words that are not related. ex. marine,

marinate

• Conservative StemmerMay fail to group words that are related. ex. run, ran

• Statistical StemmerUses word co-occurrence data to determine if they are

related.Would probably avoid the marine, marinate mistake.

Query Expansion (cont.)

SynonymsGroup by terms that mean the same concept

• ProblemMay be different depending on context

US: President = head of state = commander in chiefUK: prime minister = head of stateCorporation: president = chief executive (maybe)

• Solutions– Include synonyms in query but prefer term

matches– Use context from the whole query

“president of canada” “prime minister”

Query Expansion (cont.)

Relevance FeedbackUser selects relevant documents and they

are used to find similar documents.

Pseudo Relevance FeedbackSystem assumes the first few documents

retrieved are relevant and uses them to search for more.

No user involvement, so not as precise.

Evaluation

• Effectiveness

• Efficiency

Effectiveness

• Precision# of relevant results / # results

• SuccessWhether the first document was relevant

• Recall# relevant docs found / # relevant docs that exist

• Mean Average Precision (MAP)Average precision over all relevant documents

• Normalized Discounted Cumulative Gain (NDCG)Calculates using sum over result ranks

Calculating MAP

Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant.

Rank

Precision

1 1/1 = 1

5 2/5 = .2

7 3/7 = .43

8 4/8 = .5

10 5/10 = .5

If there were only 5 relevant documents, then(1 + .2 + .43 + .5 + .5) / 5 = .53

If we retrieved only 5 of 6 relevant documents, then(1 + .2 + .43 + .5 + .5) / 6 = .44

NDCG

• Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant.

• Calculated asN (2r(i) − 1)/ log(1 + i)

Where i is the rank and r(i) is the relevance value at that rank.

Example: with the following results where is relevant and is not

i

1 10 20

MAP

NDCG

1.00 1.00 .51 .79 .33 .55

Efficiency

• Throughput# of queries processed per secondMust use identical systems.

• LatencyTime between when the user issues a

query and the system delivers a response.

< 150ms considered “instantaneous”• Generally, improving one implies

worsening the other

Measuring Efficiency

• DirectAttempt to create a real world system and

measure statistics.Straightforward but limited to

experimenter access.

• SimulationSystem operation is simulated in software.Repeatable but is only as good as its model.

Query Evaluation

• Document-at-a-timeEvaluate each term for a document

before moving to the next document.

• Term-at-a-timeEvaluate each document for a term

before moving to the next term.

Document-at-a-Time

• Produces complete document scores early so can quickly display partial results.

• Can incrementally fetch the inverted list data so uses less memory.

Document-at-a-Time Algorithm

procedure DocumentAtATimeRetrieval(Q)L ← Array()R ← PriorityQueue()for all terms wi in Q do

li ← InvertedList(wi)L.add( li )

end forfor all documents D in the collection do

for all inverted lists li in L dosD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document

scoreend forsD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent

factorR.add( sD,D )

end forreturn the top n results from R

end procedure

Term-at-a-Time

• Does not jump between inverted lists so saves branching.

• Inner loop iterates over documents so is executed for a long time, thus is easier to optimize.

• Efficient query processing strategies have been developed for term-at-a-time.

• Preferred for efficient system implementation.

Term-at-a-Time Algorithmprocedure TermAtATimeRetrieval(Q)

A ← HashTable()for all terms wi in Q do

li ← InvertedList(wi)for all documents D in li do

swi,D ← A[D] + f(Q,C,wi)(c(wi;D))end for

end forR ← PriorityQueue()for all accumulators A[D] in A do

sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value

R.add( sD,D )end forreturn the top n results from R

end procedure

Optimization Types

• Unoptimized• Unsafe• Set Safe• Rank Safe• Score Safe

Unoptimized

• Compare the query to each document and calculate the score.

• Sort the documents. Documents with the same score may appear in any order.

• Return results in ranked order. “Top k documents” could be different.

Optimized• Unsafe

Documents returned have no guaranteed set of properties.

• Set SafeDocuments are guaranteed to be in the result set

but may not be in the same order as the unoptimized results.

• Rank SafeDocuments are guaranteed to be in the result set

and in the correct order, but document scores may not be thes same as the unoptimized results.

• Score SafeDocuments are guaranteed to be in the result set

and have the same scores as the unoptimized results.

Tupleflow

Distributed computing framework for indexing.• Flexibility

Settings made in parameter files, no ode changes required

• ScalabilityIndependent tasks spread across processors

• Disk abstractionStreaming data model

• Low abstraction penaltyCode handles custom hashing, sorting and

serialization

Traditional Indexing Approach

Create a word occurrence model by counting the unique terms in each document.

• Serial processingParse one document, move to the next

• Large memory requirements for unique word hash over large document setwords, misspellings, numbers, urls, etc.

• Different code required for each document typeDocuments, web pages, databases, etc.

Tupleflow Approach

Break processing into steps• Count terms (countsMaker)• Sort terms• Combine counts (countsReducer)

Tupleflow Example

The cat in the hat.

countsMaker sort

countsReducer

Word

Count

Word

Count Word Count

the 1 cat 1 cat 1

cat 1 hat 1 hat 1

in 1 in 1 in 1

the 1 the 1 the 2

hat 1 the 1

Tupleflow Execution Graph

• Single Processor

• Multiple Processors

filenames

read text

parse text

count words

filenames

read text

parse text

count words

combine counts

read text

parse text

count words

read text

parse text

count words

Summary

Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.

Questions?