implementing data cubes efficiently

42
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

Upload: trynt

Post on 05-Jan-2016

75 views

Category:

Documents


6 download

DESCRIPTION

Implementing Data Cubes Efficiently. Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung. Content. Background Introduction of Datacube Problem defined Lattice model Greedy algorithm How to do? How good? How bad ? Evaluations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Implementing Data Cubes  Efficiently

1

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

ImplementingData Cubes Efficiently

Page 2: Implementing Data Cubes  Efficiently

2

Content

Background Introduction of Datacube Problem defined Lattice model Greedy algorithm

How to do? How good? How bad ?

Evaluations Conclusion

Page 3: Implementing Data Cubes  Efficiently

3

Background

DSS (Decision Support System)Gain competitiveness for business

Data warehouseMaintain historical informationUse “Data cube” to summarize results Identify trendsPerformance issue (time and space)Need to reuse result (materialization of views)

Page 4: Implementing Data Cubes  Efficiently

4

Introduction of datacube Datacube

Dimensionality (number of GROUP-BYs)Aggregated data: Values in each cellDimension of datacube Detail of summaryHigher Dimension Higher detail

Common operationsDrill down: Look in more detailRoll up: Look in less detail

Page 5: Implementing Data Cubes  Efficiently

5

What is a data cube?Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

Page 6: Implementing Data Cubes  Efficiently

6

Our problem

Physically materialize the whole data cubeBest query response Heavy pre-computing, large storage space i.e. Time efficient but space inefficient

Materialize nothingWorse query responseDynamic query evaluation, less storage space i.e. Space efficient but time inefficient

Page 7: Implementing Data Cubes  Efficiently

7

Problem on materialized views

Materialize only part of the data cubeBalance the storage space and responseWhat is the best subject to materialize?Addressed in this paper

Source Size Time (sec) Ratio

From cell itself 1 2.07 N/A

View (s) 10,000 2.38 0.000031

View (p,s) 800,000 20.77 0.000023

View (p,s,c) 6,000,000 226.23 0.000037

Page 8: Implementing Data Cubes  Efficiently

8

Data? View?

We use data cube to modify aggregate data.

So what we use to model view?

Lattice!

Page 9: Implementing Data Cubes  Efficiently

9

Example of lattice diagram

8 possible grouping on the dimensionsp for Parts for Supplierc for Customer# of rows of data shown

next to the grouping

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

An example of Regular Lattice

Page 10: Implementing Data Cubes  Efficiently

10

≼ operator

Suppose c d≼ The view d can be used to derive the view c c is the ancestor of d in lattice diagram

Impose a partial order on the views Usage on dimensions

(part) (part,customer) ≼ (part) (customer) ⋠

Usage within attribute value (year) (quarter) (month) (day)≼ ≼ ≼ (year) (quarter) (week) (day)≼ ≼ ≼

week month

day

year

quarter

An example of Irregular Lattice

Page 11: Implementing Data Cubes  Efficiently

11

Regular lattices with equal domain size

Grouping attributes: A1,A2,…, An (domain: r) Attribute for aggregation: B Efficient algorithm

m: # of rows in top viewsk = log ⌈ r m⌉

Strategy k, j, and n Space Time

Space-optimal M m2n

Time-optimal k>j (2rr/(r+1))n (2rr/(r+1))n

k<j and k ≤ n/2 m m2n

k<j and k > n/2 m nCj rj

Page 12: Implementing Data Cubes  Efficiently

12

The problem

The previous technique cannot be applied to the irregular lattices

Irregular lattices is common in data warehouse The optimization of views for irregular lattice is

NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

Page 13: Implementing Data Cubes  Efficiently

13

Greedy algorithm

Being as greedy as possible in each step!!

Simple example: Use the smallest number of coins to pay $50 cents

Suppose we have many coins of 20 cents, 10 cents and 5 cents.

Page 14: Implementing Data Cubes  Efficiently

14

How to be greedy?

Common sense approach:Select the largest coin: 20 centsSelect the largest coin again: 20 centsRemaining amount = 50 – 20 – 20 = 10 centsWe cannot select the largest coin again.We choose the 2nd largest coin 10 cents instead.

Only 3 coins are needed! Optimal solution!

Page 15: Implementing Data Cubes  Efficiently

15

Definition of “benefit of view”

C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relativ

e to a set of views (S) For each w v≼

Let u be the view of least cost in S such that w u≼

Bw = max{ C(u) – C(v) ,0}

B(v,S) = ∑w v≼ Bw

Page 16: Implementing Data Cubes  Efficiently

16

Greedy algorithm

In each step Select the view with the most benefit Add it to the result

AlgorithmS={top view};for i=1 to k {

select view v not in S such that B(v,S) is maximizedS = S union {v}

}return S;

Page 17: Implementing Data Cubes  Efficiently

17

Selecting the first view

After selecting coins, let us back to our problem, selecting views.

We must materialize the top view i.e. the view grouping by all attributesCannot be constructed from other viewsAvoid going to the raw data

Page 18: Implementing Data Cubes  Efficiently

18

Selecting k views more

Space is limited! Suppose we can only select k more views.

For each view which is not yet selected, calculate the benefit of materializing it.

Pick the one with maximum benefit!!!

Let’s set k = 2 for examples.

Page 19: Implementing Data Cubes  Efficiently

19

Example

a

b c

d e f

g h

100

50 75

20 40

30

1 10

E.g. The cost of constructing view b given the view A is 100

If we choose b to materialize, the new cost of constructing view b is 50.

Page 20: Implementing Data Cubes  Efficiently

20

First round

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Notice that not only b, but also d, e, g and h can be calculated from b

So the total benefit is (100 – 50) x 5 = 250

Page 21: Implementing Data Cubes  Efficiently

21

Continue… Similarly, the benefit

of materializing c is (100 – 75) x 5 = 125a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

Page 22: Implementing Data Cubes  Efficiently

22

Not yet finish… For e,

Benefit =

(100-30) x 3

= 210

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

e 210

Page 23: Implementing Data Cubes  Efficiently

23

Let’s choose b!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

For d and f ,

Benefit =

(100-20) x 2

= 160 and

(100-40) x 2 =

120 respectively.

Benefit

b 250

c 125

d 160

e 210

f 120

Page 24: Implementing Data Cubes  Efficiently

24

Next round?

Seems we should choose e, as it has the second largest benefit.

Let’s see what will happen in the second round. Benefit

b 250

c 125

d 160

e 210

f 120

Page 25: Implementing Data Cubes  Efficiently

25

Second round!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)

Benefit

= (100 – 75) x 2 = 50

Benefit

c 50

Page 26: Implementing Data Cubes  Efficiently

26

How about choosing f?

a

b c

d e f

g h

100

50 75

40

30

1 10

If we choose f, we found that h can be effectively calculated by using f instead of b.

Benefit

= (100 – 40) + (50 – 40)

Benefit

c 50

f 7020

Page 27: Implementing Data Cubes  Efficiently

27

Easy to work out others

Benefit of d

= (50 – 20) x 2 = 60 Benefit of e

= (50 – 30) x 3 = 60 Benefit of g

= 50 – 1 = 49 Benefit of h

= 50 – 10 = 40

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Page 28: Implementing Data Cubes  Efficiently

28

Observation

In the first round, the benefit of choosing f (only 120) is far from the best choice (250)

But in second round, choosing f gives the maximum benefit!1st round Benefit

b 250

c 125

d 160

e 210

f 120

2nd round Benefit

c 50

d 60

e 70

f 70

g 49

Page 29: Implementing Data Cubes  Efficiently

29

Simple? Optimal?

Trade off again! This simple algorithm is not optimal in all cases!

Consider the following case…

Page 30: Implementing Data Cubes  Efficiently

30

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Page 31: Implementing Data Cubes  Efficiently

31

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Choose c Benefit

= (200-99) x (1 + 20 + 20)= 4141= maximum

Page 32: Implementing Data Cubes  Efficiently

32

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Now choose either 1 of b and d (same benefit)

Page 33: Implementing Data Cubes  Efficiently

33

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

How about these? Very expensive!!!

Page 34: Implementing Data Cubes  Efficiently

34

Optimal solution should be…

a

b dc

200

100 100

20 nodes

Total 1000

99

Only c is a little bit expensive.

Page 35: Implementing Data Cubes  Efficiently

35

Some theoretical result

It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

Page 36: Implementing Data Cubes  Efficiently

36

Extensions (1)

ProblemThe views in a lattice are unlikely to have the

same probability of being requested in a query.

Solution:We can weight each benefit by its probability.

Page 37: Implementing Data Cubes  Efficiently

37

Extensions (2)

Problem Instead of asking for some fixed number (k) of

views to materialize, we might instead allocate a fixed amount of space to views.

SolutionWe can consider the “benefit of each view per

unit space”.

Page 38: Implementing Data Cubes  Efficiently

38

Conclusions

Materialization of views is an essential query optimization strategy for decision-support applications.

Reason to materialize some part of the data cube but not all of the cube.

A lattice framework that models multidimensional analysis very well.

Page 39: Implementing Data Cubes  Efficiently

39

Conclusions (cont.)

Finding optimal solution in the case of irregular lattice is NP-hard.

Introduction of greedy algorithm Greedy algorithm work on this lattice and

pick the almost right views to materialize.

Page 40: Implementing Data Cubes  Efficiently

40

Conclusions (the end)

There exists cases which greedy algorithm fails to produce optimal solution.

But greedy algorithm has guaranteed performance

Expansion of greedy algorithm.

Page 41: Implementing Data Cubes  Efficiently

41

Reference

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

Page 42: Implementing Data Cubes  Efficiently

42

Thank you~

Q & A Section