1 fast computation of sparse datacubes vicky :: cao hui ping sherman :: chow sze ming cth :: chong...

27
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

Upload: anissa-francis

Post on 13-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

1

Fast Computationof Sparse Datacubes

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

Page 2: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

2

Content

Introduction Existing Methods Proposed Method: Partitioned-Cube Memory-Cube Experiment Conclusion

Page 3: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

3

Introduction Datacubes queries compute

aggregates over database relations at a variety of granularities.

Cube by: Product, Country, Date Aggregation Function: Sum(Sales)

Date

Produ

ct

Cou

ntr

ysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

Page 4: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

4

Sparseness

Cardinality is a small fraction of the size of the cross product of the attribute domains.

Interest in sparse relations, as effective datacube computation is important.

Page 5: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

5

Problem Large Domain with CUBE BY attributes

Large number of CUBE BY attributes

Existing methods are not efficient

We Need Something New Partitioned - Cube

Page 6: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

6

Existing Methods PIPESORT

Optimize overall cost by evaluating each path

Poor performance when the relation is sparse

2/k

k Lower bound of no. of sorting is

Large I / O cost for huge cuboids

Page 7: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

7

OVERLAP

Minimize Disk Access by overlapping cuboids

But I / O cost is at least quadratic in k, even given memory-sized partition

Classify the cuboids into “Partition” and “SortRun” state

I / O depends on the partition size and number of sorted runs

Page 8: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

8

Array – Based Algorithms

Partitioned the data, and store fragments to memory. Data Compression may be applied

Allow direct access to the memory cells

For sparse data, array fragments may not be fit into memory. Then, a more costly data structure would be required

Page 9: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

9

Partitioned-Cube

Partition the large relations into fragments that can be fitted into the memory It follows the recursive structure of

datacubesA sub-datacube is obtained by fixing

each possible value of a CUBE BY attribute

Page 10: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

10

Partitioned-Cube(cont.)Algorithm Partition-Cube(R, {B1, …, Bm}, A, G) R: a set of tuples {B1, …, Bn}: CUBE BY attribute A: attribute to be aggregated G: aggregate function F: finest granularity datacube tuples D: remaining tuples

Step 1: if (R fits in memory) then return Memory-Cube(R, {B1, …, Bn}, A, G)

Step 2: scan R, partition on Bj in {B1, …, Bn}Step 3: for (i = 1 to n)

(Fi, Di) = Partition-Cube(Ri, {B1, …, Bn}, A, G)Step 4: let F = union of Fi’sStep 5: let (F’, D’) = Partition-Cube(F, {B1, … Bm}, A, G)Step 6: let D = union of F’, D’ and Di’sStep 7: return (F, D)

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

Page 11: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

11

Partitioned-Cube(cont.) STEP 1: Partition the large relations into

fragments that can be fitted into the memory

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6Country Year Sale

s

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

RR1

R2

Page 12: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

12

Partitioned-Cube(cont.) STEP2: Compute the tuples in the

corresponding sub-datacube

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

R1 F1

D1

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

Country Year Sales

US ALL 29

Page 13: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

13

Partitioned-Cube(cont.) STEP3: In the same way, Compute F2 and D2

Country Year Sales

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

R2 F2

D2

Country Year Sales

HK 2000 6

HK 2001 15

HK 2002 7

Country Year Sales

HK ALL 28

Page 14: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

14

Partitioned-Cube(cont.)

Step 4:F= Step 5: by recursively call this function, get F’

and D’

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

HK 2000 6

HK 2001 15

HK 2001 7

F

Country Year Sales

All 2000 24

All 2001 20

All 2002 13

F’

D’Country Year Sales

All All 57

21 FF

Page 15: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

15

Partitioned-Cube(cont.)

Step 6:

Step 7: return F, D

i

n

iDFDD1

)''(

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

HK 2000 6

HK 2001 15

HK 2002 7

F

Country Year Sales

All 2000 24

All 2001 21

All 2002 13

Country Year Sales

All All 57

F’

D’

Country Year Sales

US ALL 29

Country Year Sales

HK ALL 28

D1

D2

D

Page 16: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

16

Partitioned-Cube(cont.) Recursively execute STEP2 if there are

more than 2 attributes

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

R1 F1

D1

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

Country Year Sales

US ALL 29

Page 17: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

17

Memory-Cube Perform complex operation over each

fragment independently

Minimize the total no. of paths in searching lattice

Share the sort work

Compute the tuples in the corresponding sub-datacube

Compute the datacube tuples with the value ALL for the attributes

Page 18: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

18

Memory-Cube Minimize the total no. of paths in

searching lattice

G(1) = D ЄG(2) = CD C Є

DG(3) = BCD BC B Є

BD DCD C

G(4) = ABCD ABC AB A Є

ABD AD D

ACD AC C

BCD BC B

BDCD

2/k

k

6 = 4C2

Page 19: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

19

Memory-Cube Share Sort Work

Re-Order the sorting sequence can improve the performance

Sorting result on shorter relation can be reused in longer relation

E.g. S6 = CD, S3 = CADAfter sorting S6, for S3, the entire relation does not have to be resorted, only each block of tuples that shares a C value needs to be independently sorted in the AD order.

Page 20: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

20

Memory-Cube Sort in-memory relation according to

the attribute

Like PIPESORT, make a single scan through the data

Aggregates all small fragments on the path

Output datacube result by combining these small fragments

Page 21: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

21

Solution Analysis I / O cost is linear of k

CPU Cost (In-memory sorts) is exponential in k

CPU Cost should be dominated by the I / O time

Page 22: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

22

Experiment CPU time v.s. No. of Tuples

Exponential in no. of CUBE BY attributes

Page 23: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

23

Experiment CPU, I / O, CPU Usage % v.s. no. of

CUBE BY attributes

CPU Usage % drops for large no. of CUBE BY attributes

Page 24: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

24

Experiment Share sorting work

CPU Time is dominated by I / O Time

Page 25: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

25

Conclusion Partitioned-Cube is a fast computation of

datacubes over large sparse relation

Minimize the number of sort orders

Show the advantages of sharing sort orders in the datacube computation

First solution with LINEAR I / O Cost

Page 26: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

26

ReferenceKenneth A. Ross , Divesh Srivastava : Fast Computation of Sparse Datacubes. VLDB 1997 : 116-125

Page 27: 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung

27

Q & A Section