top-k set similarity joins chuan xiao, wei wang, xuemin lin and haichuan shang university of new...

Top-k Set Similarity Joins

Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan ShangUniversity of New South Wales and NICTA

2

Motivation

Data Cleaning

University City State Postal Code

University of New South Wales Sydney NSW 2052

University of Sydney Sydney NSW 2006

University of Melbourne Melbourne Victoria 3010

University of Queensland Brisbane Queensland 4072

University of New South Vales Sydney NSW 2052

3

More Applications

Obama Has Busy Final Day Before Taking Office as Bush Says Farewells

New York TimesJan 19th, 2009

iht.comJan 20, 2009

4

(Traditional) Set Similarity Join

Each record is tokenized into a set Given a collection of records, the set similarity join problem is to

find all pairs of records, <x,y>, such that sim(x,y) t Common similarity functions:

jaccard:

cosine:

dice:

What if t is unknown beforehand?

tyx

yxyxJ

),(

tyx

yxyxC

),(

x = {A,B,C,D,E}y = {B,C,D,E,F}

4/6 = 0.67

4/5 = 0.8

8/10 = 0.8tyx

yxyxD

2

),(

5

What if t is unknown beforehand?

Example – using jaccard similarity function w = {A, B, C, D, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H}

If t = 0.7 no results If t = 0.4 <w,x>, <w,y>, <x,y>, <x,z>, <y,z> too many results and l

ong running time

Return the top-k results ranked by their similarity values if k = 1 <w,x>

6

Top-k Set Similarity Join

Return top-k pairs of records, ranked by similarity scores.

Advantages over traditional similarity join without specifying a threshold output results progressively benefit interactive applications produces most meaningful results under limited resources or time c

onstraints can be stopped at any time, but still guarantee sim(output results) sim(unseen pairs)

7

Straightforward Solution

Start from a certain t, repeat the following steps: answer traditional sim-join with t as threshold if # of results k, stop and output k results with highest sim else, decrease t

Example (jaccard, k = 2) w = {A, B, C, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H}

t = 0.9 no result t = 0.8 <w,x> t = 0.7 <w,x> t = 0.6 <w,x>, <x,y>

results don’t change!

Which thresholds shall we enumerate?

0.8, 0.6

8

Naïve and Index-Based Algorithms

Naïve Algorithm: Compare every pair of objects -> O(n2) time complexity

Index-based Algorithm [Sarawagi et al. SIGMOD04]:

Record Set Index Construction

Candidate Generation

Verification Result Pairs

token record_id

A w x y

B x z …

C y z …<w,x><w,y><x,y><x,z>

…

inverted lists

9

Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW0

7]

Sort the tokens by a global ordering increasing order of document frequency

Only need to index the first few tokens (prefix) for each record Example:

jaccard t = 0.8 |x y| 4 if |x|=|y|=5

x =

y =

Must share at least one token in prefix to be a candidate pair For jaccard, prefix length = |x| * (1 – t) + 1 each t is associated with a prefix l

ength

A B

C Dupper boundO(x,y) = 3 < 4!

prefix

sorted

sorted

E F G

E F G

10

Necessary Thresholds

Each prefix is associated with a threshold, i.e., the maximum possible similarity a record can achieve with other records.

What thresholds shall we enumerate? All the thresholds with which prefixes are associated!

Necessary thresholds If we change between different thresholds, there exists a

database instance where the results will change extend prefix by one token, and consider the new t

A B Cx =

1.0 0.8 0.6t

11

Event-driven Model

Problem: repeated invocation of sim-join algorithm t is decreasing run sim-join algorithm in an incremental way

Prefix Event <x, A, t> initialize prefix length for each record as 1 <x, A, 1.0> for each prefix event

probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp results.

insert x into A’s inverted list extend prefix by one token maintain prefix events with a max-heap on

t stop until t k-th temp result’s similarity

1.0 0.75x

y

z

1.0 0.8 0.6

1.0 0.9 0.8 0.7

12

topk-join - Example

A B C E

A B C E F

B C D E F

B C F G H

w

x

y

z

token record_id

A w x

B y z x w

C y z

inverted list

<x, B, 0.8>

<y, C, 0.8>

<z, C, 0.8>

<w, B, 0.75>

prefix event

(w,x) = 0.8

(y,z) = 0.43

(x,y) = 0.67

temporary result

jaccard, k=2

verified twice!

t=0.6 2nd temp result’s sim

13

Optimizations - Verification

In the above example, (w,x) and (y,z) have been verified twice How to avoid repeated verification?

memorize all verified pairs with a hash table too much memory consumption

check if this pair will be identified again when it is verified for the first time

keep only those will be identified again before algorithm stops guarantee no pair will be verified twice

A B D E F

A C D E F

x

y

1.0 0.8 0.6

if k-th temp result’s sim = 0.7

won’t be identified again!

14

Optimizations - Indexing

How to reduce inverted list size to save memory?

identified by <x, C, 0.8> or <y, C, 0.8>, yet the maximum similarity they can achieve is 4/6 = 0.67

t is decreasing calculate the upper bound of similarity for future probings into inverted lists

don’t insert into inverted list if this upper bound k-th temp result’s similarity

A C D E F

B C D E F

x

y

15

Experiment Settings

Algorithms topk-join pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based approach,

with t = 0.95, 0.90, 0.85... Measure

compare topk-join and pptopk (candidate size, running time) output results progressively

Dataset

dataset # of records avg. record size

DBLP (author, title) 855k 14.0

TREC (author, title, abstract) 348k 130.1

TREC-3GRAM 348k 868.5

UNIREF-3GRAM (protein seq.) 500k 372.9

16

Experiment Results

17

Experiment Results

18

Thank you!

Questions?

19

Related Work

Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates.

In SIGMOD, 2004. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for a

pproximate string searches. in ICDE, 2008. Prefix-based approaches

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008.

PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. I

n VLDB, 2006.

top-k set similarity joins chuan xiao, wei wang, xuemin lin and haichuan shang university of new...

Documents