engineering a set intersection algorithm for information retrieval

Engineering a Set Intersection Algorithm for Information

Retrieval

Alex Lopez-Ortiz

UNB / InterNAP

Joint work with Ian Munro and Erik Demaine

Overview

• Web Search Engine Basics

• Algorithms for set operations

• Theoretical Analysis

• Experimental Analysis

• Engineering an Improved Algorithm

• Conclusions

Web Search Engine Basics• Crawl: sequential gathering process• Document ID (DocID) for each web page

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR

SIGCOMM

SIGACT

http://acm.org/home.html

1

2

3

4

• Indexing: List of entries of type

<word, docID1 , docID2 , . . . , > E.g.

<cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4>

<SIG, 1, 2, 3, 4>

SIGCOMM

1 3 4 2

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR SIGACT

• Postings set: Set of docID’s containing a word or pattern.

SIGACT {1,3}

SIGCOMM {1,4}

SIGCOMM

1 3 4 2

Cool sites:

• SIGIR

• SIGACT

• SIGCOMM

SIGIR SIGACT

Search Engine Basics (cont.)

Postings set stored implicitly/explicitly in a string matching data structure

• PAT tree/array

• Inverted word index

• Suffix trees

• KMP (grep) ...

String Matching Problem

• Different performance characteristics for each solution

• Time/Space tradeoff (empirical)

• Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

Search Engine Basics (cont.)

A user query is of the form:

keyword1 keyword2 … keywordn

where is one of {and,or}

E.g.

computer and science or internet

Evaluating a Boolean Query

The interpretation of a boolean query is the mapping:

• keyword postings set• and (set intersection)• or (set union)

E.g.

{computer} {science} {internet}

Set Operations for Web Search Engines

• Average postings set size > 10 million

• Postings set are sorted

Intersection Time Complexity

• Worst case linear on size of postings sets:

Θ(n)

{1,3,5,7} {1,3,5,7}

• On size of output?

{1,3,5,7} {2,4,6,8}

Adaptive Algorithms

• Assume the intersection is empty.

What is the min number of comparisons

needed to ascertain this fact?

Examples

{1,2,3,4} {5,6,7,8}

Much ado About Nothing

A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection.

E.g. A={1,3,5,7} B={2,4,6,8} a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4

Adaptive Algorithms

In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in:

k · | shortest proof of non-intersection |

steps.

Ideal for crawled, “bursty” data sets

How does it work?

• <SIGACT, 1, 3, i, n>

1,_,3,... i n

DocID universe set

Measuring Performance

• 100MB Web Crawl

• 5000 queries from Google

Baseline Standard Algorithm

• Sort sets by size

• Candidate answer set is smallest set

• For each set S in increasing order by size– For each element e in candidate set

• Binary search for e in S• If e is not found remove from candidate set• Remove elements before e in S

Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

Lower Bound: Adaptive/Shortest Proof

Middle Bound: Adaptive/ Encoding of Shortest Proof

Side by Side

Lower Bound

Middle Bound

Possible Improvements

• Adaptive performs best in two-three sets

• Traditional algorithm often terminates after first pair of sets

• Galloping seems better than binary search

• Adaptive keeps a dynamic definition of “smallest set”

• Candidate elements aggressively tested

Example

{6, 7,10,11,14}

{4, 8,10,11,15}

{1, 2, 4, 5, 7, 8, 9}

Experimental Results

Test orthogonally each possible improvement

• Cyclic or Two Smallest

• Symmetric

• Update Smallest

• Advance on Common Element

• Gallop Factor/Binary Search

Binary Search vs. Gallop

Advance on Common Element

Small Adaptive

Combines best of Adaptive and Two-Smallest

• Two-smallest

• Symmetric

• Advance on common element

• Update on smallest

• Gallop with factor 2

Small Adaptive

Small AdaptiveSmall Adaptive is faster than Two-Smallest

Aggregate speed-up 2.9x comparisons

Faster than Adaptive

ConclusionsFaster intersection algorithm for Web Search

EnginesAdaptive measure for set operationsInformation theoretic “middle bound”Standard speed-up techniques for other

settings

THE END

Total #

of

elements

in a

query

Number of queries for each total size

Query Log

Example

{6, 7,10,11,14}

{4, 8,10,11,15}

{1, 2, 4, 5, 7, 8, 9, 12}

engineering a set intersection algorithm for information retrieval

Documents