answering approximate queries efficiently

51
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently

Upload: poppy

Post on 21-Jan-2016

70 views

Category:

Documents


0 download

DESCRIPTION

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica. Answering Approximate Queries Efficiently. 30,000-Foot View of Info Systems. Data Repository (RDBMS, Search Engines, etc.). Answers matching conditions. Query. Tom. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Answering Approximate Queries Efficiently

Chen LiDepartment of Computer Science

Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica

Answering Approximate Queries Efficiently

Page 2: Answering Approximate Queries Efficiently

2

30,000-Foot View of Info Systems

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions

Page 3: Answering Approximate Queries Efficiently

3

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Tom

Find movies starred Samuel Jackson

Page 4: Answering Approximate Queries Efficiently

4

How about our governor: Schwarrzenger?

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

The user doesn’t know the exact spelling!

Page 5: Answering Approximate Queries Efficiently

5

Relaxing Conditions

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Find movies with a star “similar to” Schwarrzenger.

Page 6: Answering Approximate Queries Efficiently

6

In general: Gap between Queries and Facts

• Errors in the query– The user doesn’t remember a string exactly– The user unintentionally types a wrong string

Samuel Jackson

Schwarzenegger

Samuel Jackson

Keanu ReevesStar

Samuel L. Jackson

Schwarzenegger

Samuel L. Jackson

Keanu ReevesStar

Relation R Relation S

• Errors in the database:– Data often is not clean by itself– Especially true in data integration and cleansing

Page 7: Answering Approximate Queries Efficiently

7

“Did you mean…?” features in Search Engines

Page 8: Answering Approximate Queries Efficiently

8

What if we don’t want the user to change the query?Answering Queries Approximately

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions approximately

Page 9: Answering Approximate Queries Efficiently

9

Technical Challenges

• How to relax conditions?– Name: “Schwarzenegger” vs “Schwarrzenger”– Salary: “in [50K,60K]” vs “in [49K,63K]”

• How to answer queries efficiently?– Index structures– Selectivity estimation

See our three recent VLDB papers

Page 10: Answering Approximate Queries Efficiently

10

Rest of the talk

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA• Construction and maintenance of SEPIA• Experiments• Other works

Page 11: Answering Approximate Queries Efficiently

11

Queries with Fuzzy String Predicates

• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-

0964”

• Similar to: – a domain-specific function – returns a similarity value between two strings

• Examples:– Edit distance: ed(Schwarrzenger, Schwarzenegger)=2– Cosine similarity– Jaccard coefficient distance– Soundex– …

Database

Page 12: Answering Approximate Queries Efficiently

12

• A widely used metric to define string similarity• Ed(s1,s2)= minimum # of operations (insertion,

deletion, substitution) to change s1 to s2• Example:

s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

Example Similarity Function: Edit Distance

Page 13: Answering Approximate Queries Efficiently

13

Selectivity of Fuzzy Predicates

star SIMILARTO ’Schwarrzenger’• Selectivity: # of records satisfying the predicate

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Page 14: Answering Approximate Queries Efficiently

14

Selectivity Estimation: Problem Formulation

A bag of strings

Input: fuzzy string predicate P(q, δ)

star SIMILARTO ’Schwarrzenger’

Output: # of strings s that satisfy dist(s,q) <= δ

Page 15: Answering Approximate Queries Efficiently

15

Why Selectivity Estimation?

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1980,1989];

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Movies

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1970,1971];

The optimizer needs to know the selectivity of a predicate to decide a good plan.

Page 16: Answering Approximate Queries Efficiently

16

• No “nice” order for strings• Lexicographical order?

– Similar strings could be far from each other: Kammy/Cammy– Adjacent strings have different selectivities: Cathy/Catherine

Using traditional histograms?

Page 17: Answering Approximate Queries Efficiently

17

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 18: Answering Approximate Queries Efficiently

18

Our approach: SEPIA

Selectivity Estimation of Approximate Predicates

Cluster

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

Intuition

Page 19: Answering Approximate Queries Efficiently

19

Proximity between Strings

lukas

luciano

lucia

lucas2

3Query String

Pivot2

Cluster

Edit Distance? Not discriminative enough

Page 20: Answering Approximate Queries Efficiently

20

Edit Vector from s1 to s2

• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit

distance

– Easily computable– Not symmetric– Not unique, but tend to be (ed <= 3 91% unique)

luciano

lucas<1,1,0>

<2,0,0>lucia

lucia

Page 21: Answering Approximate Queries Efficiently

21

Why Edit Vector?

More discriminative

lukas

luciano

lucia

lucas

<1,1,0><1,1,1>

<2,0,0>

Cluster

Page 22: Answering Approximate Queries Efficiently

22

SEPIA histograms: Overview

Frequency Table

Cluster 1

Cluster k

Cluster 2

...

Global PPD TablePivot p1

Pivot p2

Pivot pk

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

<0,1,0> 7

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Frequency Table

Page 23: Answering Approximate Queries Efficiently

23

Frequency table for each cluster

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

7 strings with an edit vector <0,1,0> from pi

Page 24: Answering Approximate Queries Efficiently

24

Global PPD Table

Proximity Pair Distribution table

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

Cluster

Pivot: p

String s

Query String: q

<1,0,1>

<1,1,0>ed(p,s)1 2 3

Probability

30%

60%

100%

Page 25: Answering Approximate Queries Efficiently

25

SEPIA histograms: summary

Edit Vector

......

12<0,0,1>4<0,0,0>

# of Strings

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Cluster 1

Cluster k

Cluster 2

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

...

Edit Distance

5

…………

Global PPD TablePivot p1

Pivot p2

Pivot pk

<0,1,0> 730

18

9

Count

25

22

19

8

Page 26: Answering Approximate Queries Efficiently

26

Selectivity Estimation: ed(lukas, 2)

• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions

Cluster i

lucialukas[1,1,1]

<0,1,0>Edit Vector

......

40<0,1,0>

# of Strings

Vector v1 Vector v2Percentage

(%)

<0,1,0><1,1,1> 762

Edit Distance

Count

19

... ...

Expected Contribution: 76% * 40

Global PPD Table

Frequency Table i

Page 27: Answering Approximate Queries Efficiently

27

Selectivity Estimation for ed(q,d)

• For each cluster Ci

• For each v2 in frequency table of Ci

• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)

Cluster i

pivotqv1

v2Edit Vector

......

# of Strings

Vector v1 Vector v2Percentage

(%)

v2v1 f

Edit Distance

Count

19

... ...

Expected Contribution: f * N

Global PPD Table

Frequency Table i

d

v2 N

Page 28: Answering Approximate Queries Efficiently

28

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 29: Answering Approximate Queries Efficiently

29

Clustering Strings

Two example algorithms• Lexicographic order based.• K-Medoids

– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings

Page 30: Answering Approximate Queries Efficiently

30

Number of Clusters

It affects:• Cluster quality

– Similarity of strings within each cluster

• Costs:– Space– Estimation time

Page 31: Answering Approximate Queries Efficiently

31

Constructing Frequency Tables

• For each cluster, group strings based on their edit vector from the pivot

• Count the frequency for each group

Cluster i

Pivot pi

[0,1,0]

[0,1

,0]

Page 32: Answering Approximate Queries Efficiently

32

Constructing PPD Table

• Get enough samples of string triplets (q,p,s)• Propose a few heuristics

– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

A collection of q strings

A set of clusters

Page 33: Answering Approximate Queries Efficiently

33

Dynamic Maintenance: Frequency Table

Take insertion as an example

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

New String

8

Page 34: Answering Approximate Queries Efficiently

34

Dynamic Maintenance: PPD

Pivot: pq v1

v2

ed(p,s)=2

A collection of q strings in the construction of PPD

One of the clusters in the construction of PPD

New String

Vector v1 Vector v2Percentage

(%)

100

88

76

32

Edit Distance

…………

Count

25

22

19

8

v1 v2

v1 v2

v1 v2

v1 v2

0

1

2

3

+1

Adjust

Page 35: Answering Approximate Queries Efficiently

35

Improving Estimation Accuracy

• Reasons of estimate errors– Miss hits in PPD.– Inaccurate percentage entries in PPD.

• Improvement: use sample fuzzy predicates to analyze their estimation errors

Predicates Real

P4(david, 2)P3(jordan, 2)P2(james,3)P1(tommy,2)

500600400500

Estimate

600300

600750

Relative Error+50%

-40%0%

+50%

-40% 0% +50%

25%

50%

25%

Relative Error

Probability

Page 36: Answering Approximate Queries Efficiently

36

Relative-Error Model

• Use the errors to build a model• Use the model to adjust initial estimation

d: threshold;L: query string length;IE: Initial estimate

0<=IE<=400<=IE<=40

0<=IE<=40

IE>=41

1<=L<=51<=L<=5

L>=6

...

d = 1

d = 2

d = 3

-15% -20% +17% -8% 1%

IE>=41

+12% -23% +25%

IE>=41 IE>=41

L>=6

0<=IE<=40

Page 37: Answering Approximate Queries Efficiently

37

Outline

• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 38: Answering Approximate Queries Efficiently

38

Data

• Citeseer: – 71K author names– Length: [2,20], avg = 12

• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35

• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform

• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles

Page 39: Answering Approximate Queries Efficiently

39

Setting

• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler

• Query workload:– Strings from the data– String not in the data– Results similar

• Quality measurements– Relative error: (fest – freal) / freal

– Absolute relative error : |fest – freal | / freal

Page 40: Answering Approximate Queries Efficiently

40

Clustering Algorithms

217

45

18

120

47 37

Clustering Time(sec)

Estimation Time(ms)

Average AbsoluteRelative Error (%)

k-Medoids Lexicographic

K-Metoids is better

Page 41: Answering Approximate Queries Efficiently

41

Quartile distribution of relative errors

0

0.25

0.5

0.75

1

-100 -7

5-5

0-2

5 0 25 50 75 100

Infin

ity

Relative Error (%)

Perc

enta

ge in

Wor

kloa

d

Data set 1. CLOSE_RAND; 1000 clusters

Page 42: Answering Approximate Queries Efficiently

42

Number of Clusters

Page 43: Answering Approximate Queries Efficiently

43

Effectiveness of Applying Relative-Error Model

18

25

1012

Average Absolute RelativeError for Data set 1 (%)

Average Absolute RelativeError for Data set 2 (%)

Without Error Correction With Error Correction

Page 44: Answering Approximate Queries Efficiently

44

Dynamic Maintenance

Page 45: Answering Approximate Queries Efficiently

45

Other work 1: Relaxing SQL queries with Selections/Joins

SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5

Jobs Candidates

JID Company

Zipcode

Salary CID Zipcode

ExpSalary

WorkYear

r1 Broadcom

92047 80 s1 93652 120 3

r2 Intel 93652 95 s2 92612 130 6

r3 Microsoft 82632 120 s3 82632 100 5

r4 IBM 90391 130 s4 90391 150 1

... … … … ... … … …

Page 46: Answering Approximate Queries Efficiently

46

Query Relaxation: Skyline!

{}

R J S

RJ RS SJ

RSJ

J .Salary

C.WorkYear

J .Salary <= 95C.WorkYear >=5

5

95

Page 47: Answering Approximate Queries Efficiently

47

Other work 2: Fuzzy predicates on attributes of mixed types

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Movies

Page 48: Answering Approximate Queries Efficiently

48

Mixed-Typed Predicates

• String attributes: edit distance• Numeric attributes: absolute numeric

difference

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

Page 49: Answering Approximate Queries Efficiently

49

MAT-tree: Intuition

• Indexing on two attributes is more effective than two separate indexing structures

• Numeric attribute: B-tree• String attribute: tree-based index structure?

Page 50: Answering Approximate Queries Efficiently

50

MAT-tree: Overview

• Tree-based indexing structure:– Each node has MBR for both numeric attribute and string attribute

• Compressing strings as a “compressed trie” that fits into a limited space• An edit distance between a string and compressed trie can be computed• Experiments show that MAT-tree is very efficient

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

Page 51: Answering Approximate Queries Efficiently

51

Conclusion

• It’s important to support answering approximate queries efficiently

• Our results so far:– SEPIA: provides accurate selectivity

estimation for fuzzy string predicates– Relaxing SQL queries with selections and

joins– MAT-tree: indexing structure supporting fuzzy

queries with mixed-types predicates