efficient processing of complex features for information retrieval dissertation by trevor strohman...

37
Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC, Fall 2008

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Efficient Processing of Complex Features for Information Retrieval

Dissertation by Trevor StrohmanPresented by

Laura C. VandivierFor ITCS6050, UNCC, Fall 2008

Page 2: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Overview

• Indexing• Ranking• Query Expansion• Query Evaluation• Tupleflow

Page 3: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Topics Not Covered

• Binned Probabilities• Score-Sorted Index Optimization• Document-Sorted Index

Optimization• Navigational Search with Complex

Features

Page 4: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Document Indexing

• Inverted ListA mapping from a single word to a set of

documents that contain the word

• Inverted IndexA set of inverted lists

Page 5: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Inverted Index

• Contain one inverted list for each term in the document collection

• Often omit frequently occurring words such as “a,” “and” and “the.”

Page 6: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Inverted Index Example

Sample Documents1. Cats, dogs, dogs.2. Dogs, cats, sheep.3. Whales, sheep, goats.4. Fish, whales, whales.

Inverted Indexcats

dogs

fish

goats

sheep

whales

1 1 4 3 2 3

2 2 3 4

QueryAnswer

cats 1,2

sheep + dogs

2

Page 7: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Expanding Inverted Indexes

• Include term frequencyMore terms implies “about”

cats

dogs

fish goats

sheep

whales

(1,1)

(1,2)

(4,1)

(3,1)

(2,1)

(3,1)

(2,1)

(2,1)

(3,1)

(4,2)

Page 8: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Expanding Inverted Indexes (cont.)

• Add word position informationFacilitates phrase searching

cats dogs fish goats sheep

whales

(1,1): 1

(1,2): 2,3

(4,1): 1

(3,1): 2

(2,1): 3

(3,1): 1

(2,1): 2

(2,1): 1 (3,1): 2

(4,2): 1

Page 9: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Inverted Index Statistics

• Compressed inverted indexes containing only word counts– 5% of the document collection in size– Built and queried faster

• Compressed inverted indexes containing word counts and positions– 20% of the document collection in size– Essential for high effectiveness, even in queries

not using phrases

Page 10: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Document Ranking

• Documents returned in order of relevance

• Perfect ranking impossible• Retrieval systems calculate

probability a document is relevant

Page 11: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Computing Relevance

• Assume “bag of words” with term independence

• Simple estimation

• Problems1. If a document does not contain all words of a multi-

word query it will not be retrieved.document containing 0 words = document containing some

words

2. All words are treated equally.Query = Maltese falcondocument(maltese:2, falcon:1) = document(maltese:1,falcon:2)for documents of similar length

• Smoothing can help

# occurrences

document length

Page 12: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Computing Relevance (cont.)

• Add additional features– Position/field in document, ex.

title– Proximity of query terms– Combinations

Page 13: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Computing Relevance (cont.)

Add query independent information• # links from other documents• URL depth

shorter generallonger specific

• User clicksMay match expectations but not relevance

• Dwell time• Document quality models

Unusual term distribution implies poor grammar so the document is not a good retrieval candidate

Page 14: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Query Expansion

StemmingGroups words that mean the same concept based on

natural language rules. ex: run, runs, running, ran

• Aggressive StemmerMay group words that are not related. ex. marine,

marinate

• Conservative StemmerMay fail to group words that are related. ex. run, ran

• Statistical StemmerUses word co-occurrence data to determine if they are

related.Would probably avoid the marine, marinate mistake.

Page 15: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Query Expansion (cont.)

SynonymsGroup by terms that mean the same concept

• ProblemMay be different depending on context

US: President = head of state = commander in chiefUK: prime minister = head of stateCorporation: president = chief executive (maybe)

• Solutions– Include synonyms in query but prefer term

matches– Use context from the whole query

“president of canada” “prime minister”

Page 16: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Query Expansion (cont.)

Relevance FeedbackUser selects relevant documents and they

are used to find similar documents.

Pseudo Relevance FeedbackSystem assumes the first few documents

retrieved are relevant and uses them to search for more.

No user involvement, so not as precise.

Page 17: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Evaluation

• Effectiveness

• Efficiency

Page 18: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Effectiveness

• Precision# of relevant results / # results

• SuccessWhether the first document was relevant

• Recall# relevant docs found / # relevant docs that exist

• Mean Average Precision (MAP)Average precision over all relevant documents

• Normalized Discounted Cumulative Gain (NDCG)Calculates using sum over result ranks

Page 19: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Calculating MAP

Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant.

Rank

Precision

1 1/1 = 1

5 2/5 = .2

7 3/7 = .43

8 4/8 = .5

10 5/10 = .5

If there were only 5 relevant documents, then(1 + .2 + .43 + .5 + .5) / 5 = .53

If we retrieved only 5 of 6 relevant documents, then(1 + .2 + .43 + .5 + .5) / 6 = .44

Page 20: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

NDCG

• Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant.

• Calculated asN (2r(i) − 1)/ log(1 + i)

Where i is the rank and r(i) is the relevance value at that rank.

Example: with the following results where is relevant and is not

i

1 10 20

MAP

NDCG

1.00 1.00 .51 .79 .33 .55

Page 21: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Efficiency

• Throughput# of queries processed per secondMust use identical systems.

• LatencyTime between when the user issues a

query and the system delivers a response.

< 150ms considered “instantaneous”• Generally, improving one implies

worsening the other

Page 22: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Measuring Efficiency

• DirectAttempt to create a real world system and

measure statistics.Straightforward but limited to

experimenter access.

• SimulationSystem operation is simulated in software.Repeatable but is only as good as its model.

Page 23: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Query Evaluation

• Document-at-a-timeEvaluate each term for a document

before moving to the next document.

• Term-at-a-timeEvaluate each document for a term

before moving to the next term.

Page 24: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Document-at-a-Time

• Produces complete document scores early so can quickly display partial results.

• Can incrementally fetch the inverted list data so uses less memory.

Page 25: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Document-at-a-Time Algorithm

procedure DocumentAtATimeRetrieval(Q)L ← Array()R ← PriorityQueue()for all terms wi in Q do

li ← InvertedList(wi)L.add( li )

end forfor all documents D in the collection do

for all inverted lists li in L dosD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document

scoreend forsD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent

factorR.add( sD,D )

end forreturn the top n results from R

end procedure

Page 26: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Term-at-a-Time

• Does not jump between inverted lists so saves branching.

• Inner loop iterates over documents so is executed for a long time, thus is easier to optimize.

• Efficient query processing strategies have been developed for term-at-a-time.

• Preferred for efficient system implementation.

Page 27: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Term-at-a-Time Algorithmprocedure TermAtATimeRetrieval(Q)

A ← HashTable()for all terms wi in Q do

li ← InvertedList(wi)for all documents D in li do

swi,D ← A[D] + f(Q,C,wi)(c(wi;D))end for

end forR ← PriorityQueue()for all accumulators A[D] in A do

sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value

R.add( sD,D )end forreturn the top n results from R

end procedure

Page 28: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Optimization Types

• Unoptimized• Unsafe• Set Safe• Rank Safe• Score Safe

Page 29: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Unoptimized

• Compare the query to each document and calculate the score.

• Sort the documents. Documents with the same score may appear in any order.

• Return results in ranked order. “Top k documents” could be different.

Page 30: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Optimized• Unsafe

Documents returned have no guaranteed set of properties.

• Set SafeDocuments are guaranteed to be in the result set

but may not be in the same order as the unoptimized results.

• Rank SafeDocuments are guaranteed to be in the result set

and in the correct order, but document scores may not be thes same as the unoptimized results.

• Score SafeDocuments are guaranteed to be in the result set

and have the same scores as the unoptimized results.

Page 31: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Tupleflow

Distributed computing framework for indexing.• Flexibility

Settings made in parameter files, no ode changes required

• ScalabilityIndependent tasks spread across processors

• Disk abstractionStreaming data model

• Low abstraction penaltyCode handles custom hashing, sorting and

serialization

Page 32: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Traditional Indexing Approach

Create a word occurrence model by counting the unique terms in each document.

• Serial processingParse one document, move to the next

• Large memory requirements for unique word hash over large document setwords, misspellings, numbers, urls, etc.

• Different code required for each document typeDocuments, web pages, databases, etc.

Page 33: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Tupleflow Approach

Break processing into steps• Count terms (countsMaker)• Sort terms• Combine counts (countsReducer)

Page 34: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Tupleflow Example

The cat in the hat.

countsMaker sort

countsReducer

Word

Count

Word

Count Word Count

the 1 cat 1 cat 1

cat 1 hat 1 hat 1

in 1 in 1 in 1

the 1 the 1 the 2

hat 1 the 1

Page 35: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Tupleflow Execution Graph

• Single Processor

• Multiple Processors

filenames

read text

parse text

count words

filenames

read text

parse text

count words

combine counts

read text

parse text

count words

read text

parse text

count words

Page 36: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Summary

Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.

Page 37: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Questions?