answering approximate queries efficiently

Chen LiDepartment of Computer Science

Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica

Answering Approximate Queries Efficiently

2

30,000-Foot View of Info Systems

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions

3

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Tom

Find movies starred Samuel Jackson

4

How about our governor: Schwarrzenger?



2005 Sci-Fi



… … … …

The user doesn’t know the exact spelling!

5

Relaxing Conditions



2005 Sci-Fi



… … … …

Find movies with a star “similar to” Schwarrzenger.

6

In general: Gap between Queries and Facts

• Errors in the query– The user doesn’t remember a string exactly– The user unintentionally types a wrong string

Samuel Jackson

…

Schwarzenegger

Samuel Jackson

Keanu ReevesStar

…

Samuel L. Jackson

Schwarzenegger

Samuel L. Jackson

Keanu ReevesStar

Relation R Relation S

• Errors in the database:– Data often is not clean by itself– Especially true in data integration and cleansing

7

“Did you mean…?” features in Search Engines

8

What if we don’t want the user to change the query?Answering Queries Approximately

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions approximately

9

Technical Challenges

• How to relax conditions?– Name: “Schwarzenegger” vs “Schwarrzenger”– Salary: “in [50K,60K]” vs “in [49K,63K]”

• How to answer queries efficiently?– Index structures– Selectivity estimation

See our three recent VLDB papers

10

Rest of the talk

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA• Construction and maintenance of SEPIA• Experiments• Other works

11

Queries with Fuzzy String Predicates

• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-

0964”

• Similar to: – a domain-specific function – returns a similarity value between two strings

• Examples:– Edit distance: ed(Schwarrzenger, Schwarzenegger)=2– Cosine similarity– Jaccard coefficient distance– Soundex– …

Database

12

• A widely used metric to define string similarity• Ed(s1,s2)= minimum # of operations (insertion,

deletion, substitution) to change s1 to s2• Example:

s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

Example Similarity Function: Edit Distance

13

Selectivity of Fuzzy Predicates

star SIMILARTO ’Schwarrzenger’• Selectivity: # of records satisfying the predicate



2005 Sci-Fi



… … … …

14

Selectivity Estimation: Problem Formulation

A bag of strings

Input: fuzzy string predicate P(q, δ)

star SIMILARTO ’Schwarrzenger’

Output: # of strings s that satisfy dist(s,q) <= δ

15

Why Selectivity Estimation?

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1980,1989];



2005 Sci-Fi



… … … …

Movies

SELECT *

FROM Movies


AND year BETWEEN [1970,1971];

The optimizer needs to know the selectivity of a predicate to decide a good plan.

16

• No “nice” order for strings• Lexicographical order?

– Similar strings could be far from each other: Kammy/Cammy– Adjacent strings have different selectivities: Cathy/Catherine

Using traditional histograms?

17

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

18

Our approach: SEPIA

Selectivity Estimation of Approximate Predicates

Cluster

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

Intuition

19

Proximity between Strings

lukas

luciano

lucia

lucas2

3Query String

Pivot2

Cluster

Edit Distance? Not discriminative enough

20

Edit Vector from s1 to s2

• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit

distance

– Easily computable– Not symmetric– Not unique, but tend to be (ed <= 3 91% unique)

luciano

lucas<1,1,0>

<2,0,0>lucia

lucia

21

Why Edit Vector?

More discriminative

lukas

luciano

lucia

lucas

<1,1,0><1,1,1>

<2,0,0>

Cluster

22

SEPIA histograms: Overview

Frequency Table

Cluster 1

Cluster k

Cluster 2

...

Global PPD TablePivot p1

Pivot p2

Pivot pk

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

…

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

<0,1,0> 7

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Frequency Table

23

Frequency table for each cluster

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

7 strings with an edit vector <0,1,0> from pi

24

Global PPD Table

Proximity Pair Distribution table

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

…

Cluster

Pivot: p

String s

Query String: q

<1,0,1>

<1,1,0>ed(p,s)1 2 3

Probability

30%

60%

100%

25

SEPIA histograms: summary

Edit Vector

......

12<0,0,1>4<0,0,0>

# of Strings

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Cluster 1

Cluster k

Cluster 2

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

...

Edit Distance

5

…………

Global PPD TablePivot p1

Pivot p2

Pivot pk

<0,1,0> 730

18

9

Count

25

22

19

8

…

26

Selectivity Estimation: ed(lukas, 2)

• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions

Cluster i

lucialukas[1,1,1]

<0,1,0>Edit Vector

......

40<0,1,0>

# of Strings

Vector v1 Vector v2Percentage

(%)

<0,1,0><1,1,1> 762

Edit Distance

Count

19

... ...

Expected Contribution: 76% * 40

Global PPD Table

Frequency Table i

27

Selectivity Estimation for ed(q,d)

• For each cluster Ci

• For each v2 in frequency table of Ci

• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)

Cluster i

pivotqv1

v2Edit Vector

......

# of Strings


(%)

v2v1 f

Edit Distance

Count

19

... ...

Expected Contribution: f * N

Global PPD Table

Frequency Table i

d

v2 N

28

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA



29

Clustering Strings

Two example algorithms• Lexicographic order based.• K-Medoids

– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings

30

Number of Clusters

It affects:• Cluster quality

– Similarity of strings within each cluster

• Costs:– Space– Estimation time

31

Constructing Frequency Tables

• For each cluster, group strings based on their edit vector from the pivot

• Count the frequency for each group

Cluster i

Pivot pi

[0,1,0]

[0,1

,0]

32

Constructing PPD Table

• Get enough samples of string triplets (q,p,s)• Propose a few heuristics

– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

A collection of q strings

A set of clusters

33

Dynamic Maintenance: Frequency Table

Take insertion as an example

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

New String

8

34

Dynamic Maintenance: PPD

Pivot: pq v1

v2

ed(p,s)=2

A collection of q strings in the construction of PPD

One of the clusters in the construction of PPD

New String


(%)

100

88

76

32

Edit Distance

…………

Count

25

22

19

8

…

v1 v2

v1 v2

v1 v2

v1 v2

0

1

2

3

+1

Adjust

35

Improving Estimation Accuracy

• Reasons of estimate errors– Miss hits in PPD.– Inaccurate percentage entries in PPD.

• Improvement: use sample fuzzy predicates to analyze their estimation errors

Predicates Real

P4(david, 2)P3(jordan, 2)P2(james,3)P1(tommy,2)

500600400500

Estimate

600300

600750

Relative Error+50%

-40%0%

+50%

-40% 0% +50%

25%

50%

25%

Relative Error

Probability

36

Relative-Error Model

• Use the errors to build a model• Use the model to adjust initial estimation

d: threshold;L: query string length;IE: Initial estimate

0<=IE<=400<=IE<=40

0<=IE<=40

IE>=41

1<=L<=51<=L<=5

L>=6

...

d = 1

d = 2

d = 3

-15% -20% +17% -8% 1%

IE>=41

+12% -23% +25%

IE>=41 IE>=41

L>=6

0<=IE<=40

37

Outline

• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA



38

Data

• Citeseer: – 71K author names– Length: [2,20], avg = 12

• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35

• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform

• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles

39

Setting

• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler

• Query workload:– Strings from the data– String not in the data– Results similar

• Quality measurements– Relative error: (fest – freal) / freal

– Absolute relative error : |fest – freal | / freal

40

Clustering Algorithms

217

45

18

120

47 37

Clustering Time(sec)

Estimation Time(ms)

Average AbsoluteRelative Error (%)

k-Medoids Lexicographic

K-Metoids is better

41

Quartile distribution of relative errors

0

0.25

0.5

0.75

1

-100 -7

5-5

0-2

5 0 25 50 75 100

Infin

ity

Relative Error (%)

Perc

enta

ge in

Wor

kloa

d

Data set 1. CLOSE_RAND; 1000 clusters

42

Number of Clusters

43

Effectiveness of Applying Relative-Error Model

18

25

1012

Average Absolute RelativeError for Data set 1 (%)

Average Absolute RelativeError for Data set 2 (%)

Without Error Correction With Error Correction

44

Dynamic Maintenance

45

Other work 1: Relaxing SQL queries with Selections/Joins

SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5

Jobs Candidates

JID Company

Zipcode

Salary CID Zipcode

ExpSalary

WorkYear

r1 Broadcom

92047 80 s1 93652 120 3

r2 Intel 93652 95 s2 92612 130 6

r3 Microsoft 82632 120 s3 82632 100 5

r4 IBM 90391 130 s4 90391 150 1

... … … … ... … … …

46

Query Relaxation: Skyline!

{}

R J S

RJ RS SJ

RSJ

J .Salary

C.WorkYear

J .Salary <= 95C.WorkYear >=5

5

95

47

Other work 2: Fuzzy predicates on attributes of mixed types

SELECT *

FROM Movies


AND |year – 1977| <= 3;



2005 Sci-Fi



… … … …

Movies

48

Mixed-Typed Predicates

• String attributes: edit distance• Numeric attributes: absolute numeric

difference

SELECT *

FROM Movies


AND |year – 1977| <= 3;

49

MAT-tree: Intuition

• Indexing on two attributes is more effective than two separate indexing structures

• Numeric attribute: B-tree• String attribute: tree-based index structure?

50

MAT-tree: Overview

• Tree-based indexing structure:– Each node has MBR for both numeric attribute and string attribute

• Compressing strings as a “compressed trie” that fits into a limited space• An edit distance between a string and compressed trie can be computed• Experiments show that MAT-tree is very efficient

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

51

Conclusion

• It’s important to support answering approximate queries efficiently

• Our results so far:– SEPIA: provides accurate selectivity

estimation for fuzzy string predicates– Relaxing SQL queries with selections and

joins– MAT-tree: indexing structure supporting fuzzy

queries with mixed-types predicates

answering approximate queries efficiently

Documents

similar strings

fuzzy string predicatesstars

summaryselectivity estimation

ssn similar

string exactlythe user

fuzzy string predicate

string similarityeds1

telephone number similar