fast indexes and algorithms for set similarity selection queries m. hadjieleftheriou a.chandel n....

23
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A. Chandel N. Koudas D. Srivastava

Upload: liliana-louise-cannon

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product

TRANSCRIPT

Page 1: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Fast Indexes and AlgorithmsFor Set Similarity Selection Queries

M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava

Page 2: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Strings as sets

s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …

s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’

How similar is s1 and s2 ?

Page 3: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

TF/IDF weighted similarity

Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not• idf(t) = log2[1 + N / df(t)]

Term Frequency (tf):• ‘Main’ appears twice in s2

Similarity:• Inner Product

Page 4: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Is TF important?

Information retrieval:• Given a query string retrieve relevant

documentsRelational databases:

• Given a query string retrieve relevant strings

In practice TF is small in many applications

Page 5: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

IDF similarity

Query q = {t1, …, tn}Set s = {r1, …, rm}Length len(s) = (t 2 s idf(t)2)1/2

I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)

IDF is as good as TF/IDF in practice!

Page 6: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

How can I build an index?

Let w(t, s) = idf(t) / len(s)Then I(q, s) = t 2 q \ s w(t, s) w(t, q)So

• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token

Sort lists by string id: Do a merge joinSort lists by w: Run TA/NRA

Page 7: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Example: Sort by id

Page 8: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Example: Sort by w

NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry

Page 9: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Semantic properties of IDF

Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <

w(t2, r)

Length Boundedness:• Query q, set s, threshold

– I(q, s) >= ) len(q) < len(s) < len(q) /

Page 10: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Improved NRA

Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2

• tk: encounter s2 first

Length Boundedness restricts the search in a small portion of lists

Page 11: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Something surprising

Lemma: NRA reads arbitrarily more elements than iNRA

Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

Page 12: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Any other strategies?

NRA style is breadth-firstTry depth-first:

• Sort query lists in decreasing idf order– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)

• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti– i = I <= k <= n idf(tk)2 / len(q)

• i is a natural cutoff point• 1 > 2 > … > n

Page 13: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Shortest-First

Sort q={t1, …, tn} in decreasing idf orderLet candidate set CFor 1 <= i <= n

• Skip to first entry with len(s) >= len(q)• Compute i• Let i = min(i, len(q) / )• Repeat

– s = pop next element from ti– Maintain lower/upper bounds of entries in C

• Until len(s) > max(max len C, i)

Page 14: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Comparison with NRA

Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF

But surprisingly

Page 15: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

A hybrid strategy

Run iNRA normallyUse i and max len C to stop reading from a

particular list• This guarantees that iNRA stops with or before

SF

Drawback of NRA variants:• Very high book keeping cost compared to SF

Page 16: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Experiments

DBLP, IMDB and YellowPages datasetsActors, movies, authors, businesses etc.Vary threshold, query size, query strings and

mistakesTest wall-clock time, pruning powerAlgorithms:NRA, TA, iNRA, iTA, SF, Hybrid,

Sort-by-id, Improved SQL based

Page 17: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Wall-clock time vs. Threshold

Page 18: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Wall-clock time vs. Query size

TA

NRA

Sort-by-id

iTA

SF

Page 19: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Space

Page 20: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Conclusion

Proposed a simplified TF/IDF measureIdentified strong monotonicity propertiesUsed the properties to design efficient

algorithmsSF works best overall in practice

• Achieves sub-second answers in most practical cases

Page 21: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Q&A

Page 22: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Pruning power vs. Threshold

Page 23: Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Pruning power vs. Query size

NRA TA

iTA