1 fast computation of sparse datacubes vicky :: cao hui ping sherman :: chow sze ming cth :: chong...

1

Fast Computationof Sparse Datacubes

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

2

Content

Introduction Existing Methods Proposed Method: Partitioned-Cube Memory-Cube Experiment Conclusion

3

Introduction Datacubes queries compute

aggregates over database relations at a variety of granularities.

Cube by: Product, Country, Date Aggregation Function: Sum(Sales)

Date

Produ

ct

Cou

ntr

ysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

4

Sparseness

Cardinality is a small fraction of the size of the cross product of the attribute domains.

Interest in sparse relations, as effective datacube computation is important.

5

Problem Large Domain with CUBE BY attributes

Large number of CUBE BY attributes

Existing methods are not efficient

We Need Something New Partitioned - Cube

6

Existing Methods PIPESORT

Optimize overall cost by evaluating each path

Poor performance when the relation is sparse

2/k

k Lower bound of no. of sorting is

Large I / O cost for huge cuboids

7

OVERLAP

Minimize Disk Access by overlapping cuboids

But I / O cost is at least quadratic in k, even given memory-sized partition

Classify the cuboids into “Partition” and “SortRun” state

I / O depends on the partition size and number of sorted runs

8

Array – Based Algorithms

Partitioned the data, and store fragments to memory. Data Compression may be applied

Allow direct access to the memory cells

For sparse data, array fragments may not be fit into memory. Then, a more costly data structure would be required

9

Partitioned-Cube

Partition the large relations into fragments that can be fitted into the memory It follows the recursive structure of

datacubesA sub-datacube is obtained by fixing

each possible value of a CUBE BY attribute

10

Partitioned-Cube(cont.)Algorithm Partition-Cube(R, {B1, …, Bm}, A, G) R: a set of tuples {B1, …, Bn}: CUBE BY attribute A: attribute to be aggregated G: aggregate function F: finest granularity datacube tuples D: remaining tuples

Step 1: if (R fits in memory) then return Memory-Cube(R, {B1, …, Bn}, A, G)

Step 2: scan R, partition on Bj in {B1, …, Bn}Step 3: for (i = 1 to n)

(Fi, Di) = Partition-Cube(Ri, {B1, …, Bn}, A, G)Step 4: let F = union of Fi’sStep 5: let (F’, D’) = Partition-Cube(F, {B1, … Bm}, A, G)Step 6: let D = union of F’, D’ and Di’sStep 7: return (F, D)

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

11

Partitioned-Cube(cont.) STEP 1: Partition the large relations into

fragments that can be fitted into the memory

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6Country Year Sale

s

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

RR1

R2

12

Partitioned-Cube(cont.) STEP2: Compute the tuples in the

corresponding sub-datacube

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

R1 F1

D1

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

Country Year Sales

US ALL 29

13

Partitioned-Cube(cont.) STEP3: In the same way, Compute F2 and D2

Country Year Sales

HK 2000 6

HK 2001 8

HK 2001 7

HK 2002 7

R2 F2

D2

Country Year Sales

HK 2000 6

HK 2001 15

HK 2002 7

Country Year Sales

HK ALL 28

14

Partitioned-Cube(cont.)

Step 4:F= Step 5: by recursively call this function, get F’

and D’

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

HK 2000 6

HK 2001 15

HK 2001 7

F

Country Year Sales

All 2000 24

All 2001 20

All 2002 13

F’

D’Country Year Sales

All All 57

21 FF

15

Partitioned-Cube(cont.)

Step 6:

Step 7: return F, D

i

n

iDFDD1

)''(

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

HK 2000 6

HK 2001 15

HK 2002 7

F

Country Year Sales

All 2000 24

All 2001 21

All 2002 13

Country Year Sales

All All 57

F’

D’

Country Year Sales

US ALL 29

Country Year Sales

HK ALL 28

D1

D2

D

16

Partitioned-Cube(cont.) Recursively execute STEP2 if there are

more than 2 attributes

Country Year Sales

US 2000 10

US 2001 5

US 2000 8

US 2002 6

R1 F1

D1

Country Year Sales

US 2000 18

US 2001 5

US 2002 6

Country Year Sales

US ALL 29

17

Memory-Cube Perform complex operation over each

fragment independently

Minimize the total no. of paths in searching lattice

Share the sort work

Compute the tuples in the corresponding sub-datacube

Compute the datacube tuples with the value ALL for the attributes

18

Memory-Cube Minimize the total no. of paths in

searching lattice

G(1) = D ЄG(2) = CD C Є

DG(3) = BCD BC B Є

BD DCD C

G(4) = ABCD ABC AB A Є

ABD AD D

ACD AC C

BCD BC B

BDCD

2/k

k

6 = 4C2

19

Memory-Cube Share Sort Work

Re-Order the sorting sequence can improve the performance

Sorting result on shorter relation can be reused in longer relation

E.g. S6 = CD, S3 = CADAfter sorting S6, for S3, the entire relation does not have to be resorted, only each block of tuples that shares a C value needs to be independently sorted in the AD order.

20

Memory-Cube Sort in-memory relation according to

the attribute

Like PIPESORT, make a single scan through the data

Aggregates all small fragments on the path

Output datacube result by combining these small fragments

21

Solution Analysis I / O cost is linear of k

CPU Cost (In-memory sorts) is exponential in k

CPU Cost should be dominated by the I / O time

22

Experiment CPU time v.s. No. of Tuples

Exponential in no. of CUBE BY attributes

23

Experiment CPU, I / O, CPU Usage % v.s. no. of

CUBE BY attributes

CPU Usage % drops for large no. of CUBE BY attributes

24

Experiment Share sorting work

CPU Time is dominated by I / O Time

25

Conclusion Partitioned-Cube is a fast computation of

datacubes over large sparse relation

Minimize the number of sort orders

Show the advantages of sharing sort orders in the datacube computation

First solution with LINEAR I / O Cost

26

ReferenceKenneth A. Ross , Divesh Srivastava : Fast Computation of Sparse Datacubes. VLDB 1997 : 116-125

27

Q & A Section

1 fast computation of sparse datacubes vicky :: cao hui ping sherman :: chow sze ming cth :: chong...

Documents

partition size

aggregate function f

tuples i

algorithm partitioncuber

o cost

sparse relations

memory cellsfor sparse

set of tuples