1 high-dimensional similarity join presented by yang xia wongsodihardjo, hariyanto wang hao

58
1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

Upload: preston-brittin

Post on 28-Mar-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

1

High-dimensional Similarity Join

Presented by

Yang Xia

Wongsodihardjo, Hariyanto

Wang Hao

Page 2: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

2

Agenda

Introduction Motivation R*-tree based join-kdb tree join Epsilon grid order join Summary

Page 3: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

3

Introduction

Extracting knowledge from large multi-dimensional databases.

Many data mining algorithms require to process all pair of points which have a distance not exceeding a user-given parameter .

The operation of generating all such pairs is in essence a similarity join.

Data mining algorithms can be directly performed on top of a similarity join.

Page 4: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

4

Motivation

Conventional joining algorithms cannot be directly applied to high-D similarity join, such as nested-loop join, sort-merge join, and hash-based join.

Make use of the index built on the high-D data.

Page 5: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

5

Efficient Processing of Spatial Joins Using R-trees

byT. Brinkhoff, H. P. Kriegel, and B. Seeger

SIGMOD 1993

Presented byHariyanto Wongsodihardjo

6 September 2001

Page 6: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

6

Efficient Processing of Spatial Joins Using R-trees

Presenting a study of spatial join processing using R-trees, particularly R*-trees, which is one of the most efficient members of the R-tree family

Presenting several techniques for improving spatial join execution time with respect to CPU and I/O time

Page 7: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

7

R-tree Basic Algorithms

Let S be a query rectangle of a window query. The query is performed by starting in the root and computing all entries which rectangles intersects S

For these entries, the corresponding child nodes are read into main memory and the query is performed like in the root node

The efficiency of queries depends on the goodness how R-trees assign rectangles to nodes.

Page 8: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

8

A First Approach of a Spatial Join for R-trees

Page 9: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

9

CPU-Time and I/O-Time Tuning

CPU-Time Tuning– Restricting the search space– Spatial Sorting and plane sweep

I/O-Time Tuning– Local plane-sweep order with pinning– Local z-order

Page 10: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

10

Restricting the search space

Page 11: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

11

Restricting the search space

Page 12: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

12

Restricting the search space

Page 13: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

13

Spatial sorting and plane sweep

Page 14: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

14

Spatial sorting and plane sweep

Page 15: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

15

Spatial sorting and plane sweep

Page 16: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

16

Spatial sorting and plane sweep

Page 17: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

17

Local plane-sweep order

Page 18: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

18

Local plane-sweep order

Page 19: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

19

Local plane-sweep order with pinning (SJ4)

Sequence for local plane-sweep order on example 2 is II, I,IV, III and the read schedule is <r1, s2, s1, r2, s2, r4, r3>

Pinning algorithm is based on the degree of the rectangles of both entries. The degree of an rectangle E is given by the number of intersections between rectangle E and the rectangles which belong to entries of the other tree not processed until now. Thus for ex. 2 the read schedule is <r1, s2, r4, r3, s1, r2>.

The page whose rectangle has a max degree is pinned and the join is performed for the pinned page.

Page 20: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

20

Local z-order (SJ5)

Page 21: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

21

Local z-order (SJ5)

Compute intersection between each rectangle of R with all rectangles of S

Sort resulting rectangles on the spatial location of their centers

Use z-ordering to sort resulting rectangles Then pin pages as before. The sequence for Figure 7 is I, II, III, V, IV and

the read schedule is <s1, r2, r1, s2, r4, r3, s3>.

Page 22: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

22

I/O Performance Comparison

Page 23: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

23

I/O Performance Comparison

Page 24: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

24

Conclusion

R* tree join algorithm is straightforward R* tree join algorithm improves CPU-time

by applying spatial sorting and restricting the search space

R* tree join algorithm improves I/O-time by applying local sweep order with pinning or local z-order

Page 25: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

25

High-dimensional similarity joins ( tree)

Presented By

Yang Xia

References:K. Shim, R. Srikant, and R. Agrawarl, High-dimensional similarity joins, Proc. 13th IEEE Internat. Conf. on Data Engineering, 1997, pp. 301--311.

Page 26: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

26

Introduction

tree is a main-memory data structure optimized for performing similarity joins. It uses the similarity distance limit as a parameter in building the tree.

Problem Definition -Self-join -Non-self-join -Distance metric:

Page 27: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

27

Problems with Current Indices

Number of Neighboring Leaf Nodes Storage Utilization Traversal Cost Build Time Skewed Data

Page 28: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

28

tree Definition

The co-ordinates of the points in each dimension lie between 0 and +1.

Start with a single leaf node. Whenever the number of points in a leaf node

exceeds a threshold, the leaf node is split. If the leaf node was at level i, the i dimension is

used for splitting. The node is split into parts.

Page 29: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

29

Example of tree

Page 30: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

30

Similarity Join using the tree

Page 31: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

31

Memory Management

Main-memory can hold all points within a 2 distance on the first dimension.

Page 32: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

32

Memory Management

Main-memory cannot hold all points within a 2 distance on the first dimension.

Page 33: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

33

Design Rationale

Biased Splitting: The dimension used in previous split is selected again for splitting as long as the length of the dimension in the bounding rectangle of each resulting leaf node is at least .

Sized Splitting: When we split a node, we split the node in sized chunks.

Page 34: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

34

Design Rationale

Number of Neighboring Leaf Nodes. Space Requirements. Traversal Cost. Build time. Skewed data.

Page 35: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

35

An example

Page 36: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

36

Experiments

Synthetic Data Parameters

Page 37: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

37

Experiments(1)

Page 38: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

38

Experiments(2)

Page 39: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

39

Experiments(3)

Page 40: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

40

Conclusions

tree reduces the number of neighbor leaf nodes that are considered for the join test.

tree reduces the traversal cost of finding appropriate branches in the internal nodes.

The storage cost for internal nodes is independent of the number of dimensions.

Page 41: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

41

Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-

Dimensional Data

Christian Bhm, Bernhard Braunmller, Florian Krebs, and Hans-Peter KriegelSIGMOD 2001

Presented By Wang Hao

6 September 2001

Page 42: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

42

Motivation

Indexing Based Join– R-tree family, MuX (Multipage Index) tree, etc..– Optimization conflict between CPU and IO [BK01].

Optimize CPU: fine-gained partitioning with page capacities of a few points.

Optimized IO: large block size requires less IO.

Join without Index– Seeded tree, spatial hash join, -kdb tree, etc..– Not scalable to large data sets.

-kdb tree: cache size can be from 36% to 60% of database size.

Page 43: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

43

Design Objectives

Join without Index. Optimize both CPU and IO. Scalable to large data set of size well beyond

1GB.

Page 44: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

44

Basic Ideas

Define a sort order of data: epsilon grid order.– Laying an equi-distant grid cell with cell length , over

the data space and comparing the cells lexicographically.

Use external sort to sort the data. Schedule the IO carefully during join phase.

Page 45: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

45

Epsilon Grid Order

• For two vectors p, q is true iff there exists a dimension di, such that

• Epsilon grid order is a strict order:

• irreflexive, asymmetric, and transitive.

Page 46: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

46

Epsilon Grid Order (Cont.)

A point with cannot be a join mate or p, of any point p’ which is not

• A point with cannot be a join mate or p, of any point p’ which is not

Page 47: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

47

I/O Scheduling Using the Grid Order

Unbuffered IO operations. Example: IO Units in a 2-D data space

Page 48: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

48

I/O Scheduling (Cont.)

Illustration: Pairs of IO units that must be considered for join.

In the picture, each entry in the matrix stands for one pair of IO Units.

• IO thrashing effects

Page 49: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

49

Scheduling Mode

Page 50: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

50

Scheduling Algorithm

Page 51: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

51

Joining Two IO Units

Active dimensions Minlen: minimum of length of sequences for join.

Page 52: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

52

Optimization Potentials

Use larger sequences to optimize IO. Optimize minlen for minimal CPU processing time. Comparing with -kdb tree and MuX tree, no directory

is constructed. The only space overhead is the recursion stack: O(log n)

Other possible optimizations– Modification of sort order.– Optimization in the recursion in join_sequence.

Page 53: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

53

Experiments

• Settings:

• Buffer memory: 10% of database size.

• Use Euclidean distance.

• Distance parameter : determined using algorithm in [SEKX98] such that they are suitable for clustering.

• Compare with Nested-loop join, Z-ordering R-tree based join, and MuX tree based join.

Page 54: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

54

Experiments on Uniformly Distributed 8-D Data.

Page 55: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

55

Experiments on Real 16-D Data from CAD Database.

Page 56: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

56

Conclusions and Future work

Define a strict order: epsilon grid order. A sophisticated scheduling algorithm. Several optimization techniques. Experiments show it outperforms competitive

algorithms for data sets with size up to 1.2 GB. Future work

– Parallel version of the join algorithm.– Extend the cost model to query optimizer.

Page 57: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

57

Overall Summary

We have covered three joining algorithms: R* tree-based join, e-kdb tree join, and epsilon grid order join.

Specific algorithms have been proposed to perform similarity join for each of the following cases:

– Both data set have index, – Only one data set has index,– None of them have index.

High-D similarity joins can be applied in data mining algorithms such as clustering.

Page 58: 1 High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

58

Resource Links

Readings on High-dimensional Similarity Join– http://www.comp.nus.edu.sg/~wanghao/cs6203/join.

htm