efficient top-k algorithms for fuzzy search in string collections · department of computer science...

56
Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Efficient Top-k Algorithms for Fuzzy Search in StringCollections

Rares Vernica Chen Li

Department of Computer ScienceUniversity of California, Irvine

First International Workshop on Keyword Search on StructuredData, 2009

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

Page 2: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Outline

1 Motivation

2 Efficient Top-k AlgorithmsProblem FormulationAlgorithms OverviewTop-k Single-Pass Search Algorithm

3 Experimental Evaluation

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17

Page 3: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 Results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Page 4: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 Results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Page 5: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Approximate String Queries

ID FirstName LastName # Movies10 Al Swartzberg 111 Hanna Wartenegg 112 Rik Swartzwelder 3013 Joey Swartzentruber 114 Rene Swartenbroekx 415 Arnold Schwarzenegger 28316 Luc Swartenbroeckx 1

......

...

Figure: Actor names and popularities

SELECT * FROM ActorsWHERE LastName = ’Shwartzenetrugger’ORDER BY ’# Movies’ DESC LIMIT 1;

0 ResultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17

Page 6: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Page 7: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Page 8: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Page 9: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?

Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Page 10: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Need for Ranking

ID FirstName LastName # Movies ED10 Al Swartzberg 1 811 Hanna Wartenegg 1 812 Rik Swartzwelder 30 813 Joey Swartzentruber 1 414 Rene Swartenbroekx 4 915 Arnold Schwarzenegger 283 516 Luc Swartenbroeckx 1 9

......

......

Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”.

Which one result should the system return?Which value is more important, # Movies or similarity?

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17

Page 11: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Similar Strings

Given:Weighted string collection

e.g., actors’ LastName and # Movies

Query stringe.g., “Shwartzenetrugger”

Similarity functione.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string.

Advantages over Range Search:specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

Page 12: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Similar Strings

Given:Weighted string collection

e.g., actors’ LastName and # Movies

Query stringe.g., “Shwartzenetrugger”

Similarity functione.g, Edit Distance

Scoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularity

Integer k

Return: k best strings in terms of overall score to the query string.Advantages over Range Search:

specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 results

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17

Page 13: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Algorithms Overview

Iterative RangeSearch

Single-PassSearch Two-Phase Search

RangeSearch

RangeSearch

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17

Page 14: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica

→ {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 15: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica

→ {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 16: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve

,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 17: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er

,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 18: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn

,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 19: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni

,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 20: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic

,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 21: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 22: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 23: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-grams: Overlapping substrings of fixed length

Find similar strings: e.g., “Vernica” and “Veronica”q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica→ {Ve,er,ro,on,ni,ic,ca}

Similar strings share a certain number of grams

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17

Page 24: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Page 25: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Page 26: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”

Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Page 27: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”

Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Page 28: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

q-gram Inverted List Index

q = 2

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-list index

Query string: “bcd”Identified strings are verified by computing the real similarity.Verification is usually an expensive process.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17

Page 29: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Page 30: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weights

Sort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Page 31: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorder

Scan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Page 32: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

SetupAssign IDs s.t.ascending order of IDs ≡descending order of weightsSort the IDs on each list in ascendingorderScan the lists corresponding to thegrams in the query.e.g., “bcd”

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Page 33: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Naïve approach: Round-RobinScan all the lists in the same timeMaintain a list of “open” IDs (mightstill appear on some of the lists)Store the best k “closed” IDs in atop-k bufferStop when the top-k buffer cannotimprove

ID String Weight1 ab 0.802 ccd 0.703 cd 0.604 abcd 0.505 bcc 0.40

Figure: Dataset

ab cc cd bc1 2 2 44 5 3 5

4

Figure: Gram inverted-listindex

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17

Page 34: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

99

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 35: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1

2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

1

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 36: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2

2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

12

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 37: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

122

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 38: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

122

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 39: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain thelist of “open” IDs

2 Skip elements

ab bc cd→1 →2 →2

20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

9

22

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 40: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

22

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 41: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

1

Figure:Top-kbuffer,k = 1

22

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 42: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1 →2 →2→20

3

421 4 5

......

...19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 43: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

34

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 44: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

34

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 45: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

420

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 46: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

420

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 47: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20 →3 →4

21 4 5...

......

19 19

20

20...

...

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 48: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Top-k Single-pass Search Algorithm

Heap-basedTraverse the lists in a sortedorder using a heap on the topIDs of the listsAdvantages:

1 No need to maintain the listof “open” IDs

2 Skip elements

ab bc cd

1

2

2→20

3

421 4 5

......

...19 19→20 →20

......

Figure: Graminverted-lists for query“abcd”

2

Figure:Top-kbuffer,k = 1

20

Figure:Min-heap

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17

Page 49: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Algorithms Overview

Iterative RangeSearch

Single-PassSearch Two-Phase Search

RangeSearch

RangeSearch

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17

Page 50: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Experimental Setting

Datasets:IMDB Actor Names1

actor names and the number of movies they played in1.2 million actors, average name length 15weight is the number of movies (log normalized)

WEB Corpus Word Grams2

sequences of English words and their frequency on the Web2.4 million sequences, average sequence length 20weight is the frequency (log normalized)

Jaccard similarity and normalized edit similarity, q = 3Index and data are stored in main memory at all timesImplemented in C++ (GNU compiler) on Ubuntu Linux OSIntel 2.40GHz PC, 2GB RAM

1http://www.imdb.com/interfaces2http://www.ldc.upenn.edu/Catalog LDC2006T13Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17

Page 51: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Benefits of Skipping Elements

0.2 0.4 0.6 0.8 1.0 1.2

Dataset Size (millions)

0

10

20

30

40

50

Tim

e (m

s)

SPSSPS*

Average running time fortop-10 queries. IMDBdataset with Jaccardsimilarity. Single-PassSearch (SPS) algorithmand SPS withoutskipping (SPS*).

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17

Page 52: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Potential of the Two-Phase Algorithm

Q1 Q2 Q3

Queries

0

10

20

30

40

Tim

e (m

s)

Running time for 3top-10 queries. WEBCorpus dataset withnormalized editsimilarity. Two-Phasealgorithm with differentinitial thresholds.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17

Page 53: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Optimum Initial Threshold for the Two-Phase Algorithm

0.4 0.8 1.2 1.6 2.0 2.4

Dataset Size (millions)

0

20

40

60

80

100

Tim

e (m

s)

SPS2PH2PH Opt

Average running time fortop-10 queries. WebCorpus dataset withnormalized editsimilarity. Single-PassSearch (SPS) algorithm,Two-Phase (2PH)algorithm, 2PH with theoptimum initial threshold(2PH Opt).The Iterative Range Searchalgorithm was to expensive to beplotted. The average running timewas around 5s.

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17

Page 54: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

Summary

Approximate ranking queries in string collectionsUseful when mismatch between query and dataPropose three approaches to solve the problem:

1 Use existing approximate range search algorithms as a “black box”Proves to be the most expensive

2 Use particularities of the top-k problemProves to be very efficient

3 Combine (1) and (2) sequentiallyProves to be slightly more efficient

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17

Page 55: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

The Flamingo Project

This work is part ofThe Flamingo Project at UC Irvinehttp://flamingo.ics.uci.edu

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17

Page 56: Efficient Top-k Algorithms for Fuzzy Search in String Collections · Department of Computer Science University of California, Irvine First International Workshop on Keyword Search

A Quick Note on Related Work

Fagin et. al [1]similarity on multiplenumerical attributestraverse list of IDsone list per attributelists sorted on similarity tothat attributelists have different orders ofIDsall IDs appear on all the lists

Our Settingsimilarity on one stringattributetraverse list of IDsone list per q-gramlists sorted on global weight

lists have the same order ofIDsa subset of IDs appear oneach list

[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001

Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17