engineering a set intersection algorithm for information retrieval

32
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

Upload: russ

Post on 05-Jan-2016

25 views

Category:

Documents


4 download

DESCRIPTION

Engineering a Set Intersection Algorithm for Information Retrieval. Alex Lopez-Ortiz UNB / InterNAP. Joint work with Ian Munro and Erik Demaine. Overview. Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Engineering a Set Intersection Algorithm for Information Retrieval

Engineering a Set Intersection Algorithm for Information

Retrieval

Alex Lopez-Ortiz

UNB / InterNAP

Joint work with Ian Munro and Erik Demaine

Page 2: Engineering a Set Intersection Algorithm for Information Retrieval

Overview

• Web Search Engine Basics

• Algorithms for set operations

• Theoretical Analysis

• Experimental Analysis

• Engineering an Improved Algorithm

• Conclusions

Page 3: Engineering a Set Intersection Algorithm for Information Retrieval

Web Search Engine Basics• Crawl: sequential gathering process• Document ID (DocID) for each web page

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR

SIGCOMM

SIGACT

http://acm.org/home.html

1

2

3

4

Page 4: Engineering a Set Intersection Algorithm for Information Retrieval

• Indexing: List of entries of type

<word, docID1 , docID2 , . . . , > E.g.

<cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4>

<SIG, 1, 2, 3, 4>

SIGCOMM

1 3 4 2

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR SIGACT

Page 5: Engineering a Set Intersection Algorithm for Information Retrieval

• Postings set: Set of docID’s containing a word or pattern.

SIGACT {1,3}

SIGCOMM {1,4}

SIGCOMM

1 3 4 2

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR SIGACT

Page 6: Engineering a Set Intersection Algorithm for Information Retrieval

Search Engine Basics (cont.)

Postings set stored implicitly/explicitly in a string matching data structure

• PAT tree/array

• Inverted word index

• Suffix trees

• KMP (grep) ...

Page 7: Engineering a Set Intersection Algorithm for Information Retrieval

String Matching Problem

• Different performance characteristics for each solution

• Time/Space tradeoff (empirical)

• Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

Page 8: Engineering a Set Intersection Algorithm for Information Retrieval

Search Engine Basics (cont.)

A user query is of the form:

keyword1 keyword2 … keywordn

where is one of {and,or}

E.g.

computer and science or internet

Page 9: Engineering a Set Intersection Algorithm for Information Retrieval

Evaluating a Boolean Query

The interpretation of a boolean query is the mapping:

• keyword postings set• and (set intersection)• or (set union)

E.g.

{computer} {science} {internet}

Page 10: Engineering a Set Intersection Algorithm for Information Retrieval

Set Operations for Web Search Engines

• Average postings set size > 10 million

• Postings set are sorted

Page 11: Engineering a Set Intersection Algorithm for Information Retrieval

Intersection Time Complexity

• Worst case linear on size of postings sets:

Θ(n)

{1,3,5,7} {1,3,5,7}

• On size of output?

{1,3,5,7} {2,4,6,8}

Page 12: Engineering a Set Intersection Algorithm for Information Retrieval

Adaptive Algorithms

• Assume the intersection is empty.

What is the min number of comparisons

needed to ascertain this fact?

Examples

{1,2,3,4} {5,6,7,8}

Page 13: Engineering a Set Intersection Algorithm for Information Retrieval

Much ado About Nothing

A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection.

E.g. A={1,3,5,7} B={2,4,6,8} a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4

Page 14: Engineering a Set Intersection Algorithm for Information Retrieval

Adaptive Algorithms

In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in:

k · | shortest proof of non-intersection |

steps.

Ideal for crawled, “bursty” data sets

Page 15: Engineering a Set Intersection Algorithm for Information Retrieval

How does it work?

• <SIGACT, 1, 3, i, n>

1,_,3,... i n

DocID universe set

Page 16: Engineering a Set Intersection Algorithm for Information Retrieval

Measuring Performance

• 100MB Web Crawl

• 5000 queries from Google

Page 17: Engineering a Set Intersection Algorithm for Information Retrieval

Baseline Standard Algorithm

• Sort sets by size

• Candidate answer set is smallest set

• For each set S in increasing order by size– For each element e in candidate set

• Binary search for e in S• If e is not found remove from candidate set• Remove elements before e in S

Page 18: Engineering a Set Intersection Algorithm for Information Retrieval

Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

Page 19: Engineering a Set Intersection Algorithm for Information Retrieval

Lower Bound: Adaptive/Shortest Proof

Page 20: Engineering a Set Intersection Algorithm for Information Retrieval

Middle Bound: Adaptive/ Encoding of Shortest Proof

Page 21: Engineering a Set Intersection Algorithm for Information Retrieval

Side by Side

Lower Bound

Middle Bound

Page 22: Engineering a Set Intersection Algorithm for Information Retrieval

Possible Improvements

• Adaptive performs best in two-three sets

• Traditional algorithm often terminates after first pair of sets

• Galloping seems better than binary search

• Adaptive keeps a dynamic definition of “smallest set”

• Candidate elements aggressively tested

Page 23: Engineering a Set Intersection Algorithm for Information Retrieval

Example

{6, 7,10,11,14}

{4, 8,10,11,15}

{1, 2, 4, 5, 7, 8, 9}

Page 24: Engineering a Set Intersection Algorithm for Information Retrieval

Experimental Results

Test orthogonally each possible improvement

• Cyclic or Two Smallest

• Symmetric

• Update Smallest

• Advance on Common Element

• Gallop Factor/Binary Search

Page 25: Engineering a Set Intersection Algorithm for Information Retrieval

Binary Search vs. Gallop

Page 26: Engineering a Set Intersection Algorithm for Information Retrieval

Advance on Common Element

Page 27: Engineering a Set Intersection Algorithm for Information Retrieval

Small Adaptive

Combines best of Adaptive and Two-Smallest

• Two-smallest

• Symmetric

• Advance on common element

• Update on smallest

• Gallop with factor 2

Page 28: Engineering a Set Intersection Algorithm for Information Retrieval

Small Adaptive

Page 29: Engineering a Set Intersection Algorithm for Information Retrieval

Small AdaptiveSmall Adaptive is faster than Two-Smallest

Aggregate speed-up 2.9x comparisons

Faster than Adaptive

Page 30: Engineering a Set Intersection Algorithm for Information Retrieval

ConclusionsFaster intersection algorithm for Web Search

EnginesAdaptive measure for set operationsInformation theoretic “middle bound”Standard speed-up techniques for other

settings

THE END

Page 31: Engineering a Set Intersection Algorithm for Information Retrieval

Total #

of

elements

in a

query

Number of queries for each total size

Query Log

Page 32: Engineering a Set Intersection Algorithm for Information Retrieval

Example

{6, 7,10,11,14}

{4, 8,10,11,15}

{1, 2, 4, 5, 7, 8, 9, 12}