fast indexes and algorithms for set similarity selection queries m. hadjieleftheriou a.chandel n....
DESCRIPTION
TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner ProductTRANSCRIPT
Fast Indexes and AlgorithmsFor Set Similarity Selection Queries
M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava
Strings as sets
s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …
s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’
How similar is s1 and s2 ?
TF/IDF weighted similarity
Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not• idf(t) = log2[1 + N / df(t)]
Term Frequency (tf):• ‘Main’ appears twice in s2
Similarity:• Inner Product
Is TF important?
Information retrieval:• Given a query string retrieve relevant
documentsRelational databases:
• Given a query string retrieve relevant strings
In practice TF is small in many applications
IDF similarity
Query q = {t1, …, tn}Set s = {r1, …, rm}Length len(s) = (t 2 s idf(t)2)1/2
I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)
IDF is as good as TF/IDF in practice!
How can I build an index?
Let w(t, s) = idf(t) / len(s)Then I(q, s) = t 2 q \ s w(t, s) w(t, q)So
• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token
Sort lists by string id: Do a merge joinSort lists by w: Run TA/NRA
Example: Sort by id
Example: Sort by w
NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry
Semantic properties of IDF
Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <
w(t2, r)
Length Boundedness:• Query q, set s, threshold
– I(q, s) >= ) len(q) < len(s) < len(q) /
Improved NRA
Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2
• tk: encounter s2 first
Length Boundedness restricts the search in a small portion of lists
Something surprising
Lemma: NRA reads arbitrarily more elements than iNRA
Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property
Any other strategies?
NRA style is breadth-firstTry depth-first:
• Sort query lists in decreasing idf order– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)
• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti– i = I <= k <= n idf(tk)2 / len(q)
• i is a natural cutoff point• 1 > 2 > … > n
Shortest-First
Sort q={t1, …, tn} in decreasing idf orderLet candidate set CFor 1 <= i <= n
• Skip to first entry with len(s) >= len(q)• Compute i• Let i = min(i, len(q) / )• Repeat
– s = pop next element from ti– Maintain lower/upper bounds of entries in C
• Until len(s) > max(max len C, i)
Comparison with NRA
Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF
But surprisingly
A hybrid strategy
Run iNRA normallyUse i and max len C to stop reading from a
particular list• This guarantees that iNRA stops with or before
SF
Drawback of NRA variants:• Very high book keeping cost compared to SF
Experiments
DBLP, IMDB and YellowPages datasetsActors, movies, authors, businesses etc.Vary threshold, query size, query strings and
mistakesTest wall-clock time, pruning powerAlgorithms:NRA, TA, iNRA, iTA, SF, Hybrid,
Sort-by-id, Improved SQL based
Wall-clock time vs. Threshold
Wall-clock time vs. Query size
TA
NRA
Sort-by-id
iTA
SF
Space
Conclusion
Proposed a simplified TF/IDF measureIdentified strong monotonicity propertiesUsed the properties to design efficient
algorithmsSF works best overall in practice
• Achieves sub-second answers in most practical cases
Q&A
Pruning power vs. Threshold
Pruning power vs. Query size
NRA TA
iTA