implementing data cubes efficiently

1

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

ImplementingData Cubes Efficiently

2

Content

Background Introduction of Datacube Problem defined Lattice model Greedy algorithm

How to do? How good? How bad ?

Evaluations Conclusion

3

Background

DSS (Decision Support System)Gain competitiveness for business

Data warehouseMaintain historical informationUse “Data cube” to summarize results Identify trendsPerformance issue (time and space)Need to reuse result (materialization of views)

4

Introduction of datacube Datacube

Dimensionality (number of GROUP-BYs)Aggregated data: Values in each cellDimension of datacube Detail of summaryHigher Dimension Higher detail

Common operationsDrill down: Look in more detailRoll up: Look in less detail

5

What is a data cube?Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

6

Our problem

Physically materialize the whole data cubeBest query response Heavy pre-computing, large storage space i.e. Time efficient but space inefficient

Materialize nothingWorse query responseDynamic query evaluation, less storage space i.e. Space efficient but time inefficient

7

Problem on materialized views

Materialize only part of the data cubeBalance the storage space and responseWhat is the best subject to materialize?Addressed in this paper

Source Size Time (sec) Ratio

From cell itself 1 2.07 N/A

View (s) 10,000 2.38 0.000031

View (p,s) 800,000 20.77 0.000023

View (p,s,c) 6,000,000 226.23 0.000037

8

Data? View?

We use data cube to modify aggregate data.

So what we use to model view?

Lattice!

9

Example of lattice diagram

8 possible grouping on the dimensionsp for Parts for Supplierc for Customer# of rows of data shown

next to the grouping

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

An example of Regular Lattice

10

≼ operator

Suppose c d≼ The view d can be used to derive the view c c is the ancestor of d in lattice diagram

Impose a partial order on the views Usage on dimensions

(part) (part,customer) ≼ (part) (customer) ⋠

Usage within attribute value (year) (quarter) (month) (day)≼ ≼ ≼ (year) (quarter) (week) (day)≼ ≼ ≼

week month

day

year

quarter

An example of Irregular Lattice

11

Regular lattices with equal domain size

Grouping attributes: A1,A2,…, An (domain: r) Attribute for aggregation: B Efficient algorithm

m: # of rows in top viewsk = log ⌈ r m⌉

Strategy k, j, and n Space Time

Space-optimal M m2n

Time-optimal k>j (2rr/(r+1))n (2rr/(r+1))n

k<j and k ≤ n/2 m m2n

k<j and k > n/2 m nCj rj

12

The problem

The previous technique cannot be applied to the irregular lattices

Irregular lattices is common in data warehouse The optimization of views for irregular lattice is

NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

13

Greedy algorithm

Being as greedy as possible in each step!!

Simple example: Use the smallest number of coins to pay $50 cents

Suppose we have many coins of 20 cents, 10 cents and 5 cents.

14

How to be greedy?

Common sense approach:Select the largest coin: 20 centsSelect the largest coin again: 20 centsRemaining amount = 50 – 20 – 20 = 10 centsWe cannot select the largest coin again.We choose the 2nd largest coin 10 cents instead.

Only 3 coins are needed! Optimal solution!

15

Definition of “benefit of view”

C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relativ

e to a set of views (S) For each w v≼

Let u be the view of least cost in S such that w u≼

Bw = max{ C(u) – C(v) ,0}

B(v,S) = ∑w v≼ Bw

16

Greedy algorithm

In each step Select the view with the most benefit Add it to the result

AlgorithmS={top view};for i=1 to k {

select view v not in S such that B(v,S) is maximizedS = S union {v}

}return S;

17

Selecting the first view

After selecting coins, let us back to our problem, selecting views.

We must materialize the top view i.e. the view grouping by all attributesCannot be constructed from other viewsAvoid going to the raw data

18

Selecting k views more

Space is limited! Suppose we can only select k more views.

For each view which is not yet selected, calculate the benefit of materializing it.

Pick the one with maximum benefit!!!

Let’s set k = 2 for examples.

19

Example

a

b c

d e f

g h

100

50 75

20 40

30

1 10

E.g. The cost of constructing view b given the view A is 100

If we choose b to materialize, the new cost of constructing view b is 50.

20

First round

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Notice that not only b, but also d, e, g and h can be calculated from b

So the total benefit is (100 – 50) x 5 = 250

21

Continue… Similarly, the benefit

of materializing c is (100 – 75) x 5 = 125a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

22

Not yet finish… For e,

Benefit =

(100-30) x 3

= 210

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

e 210

23

Let’s choose b!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

For d and f ,

Benefit =

(100-20) x 2

= 160 and

(100-40) x 2 =

120 respectively.

Benefit

b 250

c 125

d 160

e 210

f 120

24

Next round?

Seems we should choose e, as it has the second largest benefit.

Let’s see what will happen in the second round. Benefit

b 250

c 125

d 160

e 210

f 120

25

Second round!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)

Benefit

= (100 – 75) x 2 = 50

Benefit

c 50

26

How about choosing f?

a

b c

d e f

g h

100

50 75

40

30

1 10

If we choose f, we found that h can be effectively calculated by using f instead of b.

Benefit

= (100 – 40) + (50 – 40)

Benefit

c 50

f 7020

27

Easy to work out others

Benefit of d

= (50 – 20) x 2 = 60 Benefit of e

= (50 – 30) x 3 = 60 Benefit of g

= 50 – 1 = 49 Benefit of h

= 50 – 10 = 40

a

b c

d e f

g h

100

50 75

20 40

30

1 10

28

Observation

In the first round, the benefit of choosing f (only 120) is far from the best choice (250)

But in second round, choosing f gives the maximum benefit!1st round Benefit

b 250

c 125

d 160

e 210

f 120

2nd round Benefit

c 50

d 60

e 70

f 70

g 49

29

Simple? Optimal?

Trade off again! This simple algorithm is not optimal in all cases!

Consider the following case…

30

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

31

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Choose c Benefit

= (200-99) x (1 + 20 + 20)= 4141= maximum

32

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Now choose either 1 of b and d (same benefit)

33

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

How about these? Very expensive!!!

34

Optimal solution should be…

a

b dc

200

100 100

20 nodes

Total 1000

99

Only c is a little bit expensive.

35

Some theoretical result

It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

36

Extensions (1)

ProblemThe views in a lattice are unlikely to have the

same probability of being requested in a query.

Solution:We can weight each benefit by its probability.

37

Extensions (2)

Problem Instead of asking for some fixed number (k) of

views to materialize, we might instead allocate a fixed amount of space to views.

SolutionWe can consider the “benefit of each view per

unit space”.

38

Conclusions

Materialization of views is an essential query optimization strategy for decision-support applications.

Reason to materialize some part of the data cube but not all of the cube.

A lattice framework that models multidimensional analysis very well.

39

Conclusions (cont.)

Finding optimal solution in the case of irregular lattice is NP-hard.

Introduction of greedy algorithm Greedy algorithm work on this lattice and

pick the almost right views to materialize.

40

Conclusions (the end)

There exists cases which greedy algorithm fails to produce optimal solution.

But greedy algorithm has guaranteed performance

Expansion of greedy algorithm.

41

Reference

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

42

Thank you~

Q & A Section

implementing data cubes efficiently

Documents

view p

bysaggregated data

data cubebalance

aggregate data

view cc

model view

rows of data shownnext

cost of view vbv