october 10, 20151 efficient methods for data cube computation and data generalization chapter 4...

27
June 20, 2022 1 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Upload: reynold-barber

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 1

Efficient Methods for Data Cube Computation and Data

Generalization

Chapter 4 (4.1)

Page 2: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Data Generalization

It is a process of abstracting conceptual level knowledge from large set of task-relevant data.

Two types of analysis : Descriptive data mining: Describes data in a concise

manner, highlighting interesting general properties. Supports interest.

Predictive data mining: constructs a model and attempts to predict behavior of new data. ( classification, regression….)

April 21, 2023 2

Page 3: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 3

A Data Cube (MOLAP)

• Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are precomputed.

• Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions.

• For many cells in a cuboid the measure value is zero and cells are of little or no interest. Cuboids are often sparse.

Page 4: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Partial Materialization Precomputation of some of the cuboids in advance

leads to fast response time and avoids redundant computations during on-line analytical processing.

Data cube materialization/ pre-computation No materialization: Don’t precompute any of the

non-base cuboid. Leads to multidimensional aggregation on the fly and is slow.

Full materialization: Precompute all the cubes. Running queries will be very fast. Requires huge memory.

Partial Materialization: Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion.

April 21, 2023 4

Page 5: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 5

Outline

Types of cells : Base cell, aggregate cell, cell relationship

Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell

Cube

Efficient Computation of Data Cubes

Multiway Array Aggregation

BUC

Star Cubing

Page 6: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 6

A Data Cube: sales

10 11 12 3 10 1

11 9 6 9 6 7

12 9 8 5 7 3

I1 I2 I3

New York

Chicago

Toronto

47

48

44

All

46 37 36 22 29 14 184All

Bra

nc

h

Product

Aggregate cell Base cell Apex CuboidProduct cuboid

Branch cuboidBase cuboid

Vancouver

I5 I6I4

5 8 1013 6 3 45

Page 7: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 7

- Types of cells

Types of cells Base cell: a cell which belongs to a base cuboid Aggregate cell: a cell which belongs to a non-base cuboid Each aggregate dimension is indicated by a “*”

Ancestor-descendent relationship between cells: dimensions are (branch, product, year)

1-D cell c1 = (New York, *, *, 2000) is an ancestor of a

2-D cell c2 = (New York, I1, *, 400) and a 3-D cell c3 = (New York, I1, 2013, 111). c3 is a descendent of c1 and c2;

In an n-D data cube an i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a j-D cell b=(b1,b2,…bn,measure_b) if

1) i<j and 2) for 1≤m≤n am=bm whenever am

3) if j=i+1 a is called parent of b or b is a child of a""

Page 8: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 8

- Types of cubes

Full cube: All cells and cuboids are materialized. All possible combination of dimensions and values.

or

Iceberg cube: Partial materialization. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. count(*) >= min support Iceberg Condition

Closed cube: No ancestor cell is created if its measure is equal to that of its descendent cell.

Shell cube: Only cuboids with limited number of dimensions are created.

n2 11

n

iiL

Page 9: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10}

How many sub-patterns for first base cell

Total number of aggregate cells is Ignore all of the aggregate cells that can be obtained by

replacing some constants by “*” while keeping the same measure value.

Only 3 really offer new information. {(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20}

April 21, 2023 9

62101

Page 10: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Example

Which are the closed cells?

Similarly we can also get

April 21, 2023 10

can be derived from the closed cell

Page 11: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

11

Iceberg Cube, Closed Cube & Cube Shell

Is iceberg cube good enough? 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}

How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!

Close cube: Closed cell c: if there exists no cell d, s.t. d is a descendant of

c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only

3 cells Cube Shell

Precompute only the cuboids involving a small # of dimensions, e.g., 3

More dimension combinations will need to be computed on the fly

For (A1, A2, … A10), how many combinations to compute?

Page 12: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 12

Outline

Types of cells

Types of Cubes

Efficient Computation of Data Cubes

Multiway Array Aggregation

BUC

Star Cubing

Page 13: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 13

- Efficient Computation of Data Cubes

Preliminary cube computation tricks

Computing full/iceberg cubes: 2 methodologies

Top-Down: Multi-Way array aggregation

Bottom-Up: Bottom-up computation: BUC Star-Cubing: Integrates top-down and bottom-up

Page 14: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 14

-- Preliminary Cube Computation Tricks

Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. (ROLAP)

Aggregates may be computed from previously computed aggregates, rather than from the base fact table

Cache-results: accumulating results of already computed cuboid to reduce disk I/Os. Higher-level aggregates are computed from lower-level aggregates rather than base facts.

Smallest-child: computing a cuboid from the smallest, previously computed cuboid. Cbranch C{ branch, year}, C{branch, item}

Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads

Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used

Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used

Page 15: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 15

-- Multi-Way Array Aggregation …

Used for MOLAP and full cube computation

Array-based “bottom-up” algorithm

Using multi-dimensional chunks

Simultaneous aggregation on multiple dimensions

Intermediate aggregate values are re-used for computing ancestor cuboids

Cannot do Apriori pruning: No iceberg optimization

a ll

A B

A B

A B C

A C B C

C

Page 16: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 16

… -- Multi-way Array Aggregation …

Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the

order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.

What is the best traversing order to do multi-way aggregation?

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

B

4428 56

4024 52

3620

60

Page 17: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 17

… -- Multi-way Array Aggregation …

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B

Page 18: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 18

… -- Multi-way Array Aggregation …

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B

AB requires longest scan, i.e scanning of 49th chunk

Page 19: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 19

… -- Multi-way Array Aggregation …

Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively.

Therefore AB is the smallest and AC is the largest 2-D planes

If chunks are scanned as 1, 2, 3, … then 156,000 memory units are needed (40*400+40*1000+100*1000)

If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400*4000+40*1000+10*10*100)

Page 20: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 20

… -- Multi-way Array Aggregation …

a ll

A B

A B

A B C

A C B C

C

ABC

All

A B C

BCACAB

Needs 156,000Memory units

Needs 1,641,000 Memory units

Page 21: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 21

… -- Multi-way Array Aggregation

Method: the planes should be sorted and computed according to their size in ascending order

Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane

Limitation of the method: computing well only for a small number of dimensions

If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored

Page 22: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 22

-- Bottom-Up Computation (BUC) …

Bottom-up cube computation

(Note: top-down in our view!)

Divides dimensions into partitions and facilitates iceberg pruning

If a partition does not satisfy min_sup, its descendants can be pruned

If minsup = 1 compute full CUBE!

No simultaneous aggregation

a ll

A B C

A C B C

A B C A B D A C D B C D

A D B D C D

D

A B C D

A B

1 a ll

2 A 1 0 B 1 4 C

7 A C 1 1 B C

4 A B C 6 A B D 8 A C D 1 2 B C D

9 A D 1 3 B D 1 5 C D

1 6 D

5 A B C D

3 A B

Page 23: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

2323

BUC: Partitioning

Usually, entire data set can’t fit in main memory

Sort distinct values partition into blocks that fit

Continue processing Optimizations

Partitioning External Sorting, Hashing, Counting

Sort Ordering dimensions to encourage

pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the

partitions-greater pruning opportunity Collapsing duplicates

Can’t do holistic aggregates anymore!

Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first.

Page 24: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 24

--- BUC: Example (Having count(*) > 5) …

5 1 0 1

1 0 8 9

1 1 1 8

5 1 0 1

1 0 8 9

1 1 1 8

Q1 Q2 Q3 Q4

I1

I2

I3

New York

Toronto

3 1 1 2

8 1 2 1

2 1 11 8

New-York

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

I1

I2

I3

Toronto

1

1

8

12

3

I1

I2

I3

Page 25: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

April 21, 2023 25

… --- BUC: Example (Having count(*) > 5)

77

20 5 23 29

All

B,P,Q

P,B Q,B Q,P

B P Q

1

2

3

7

46

5

8 2 1 3

9 1 10 10

3 2 12 16

Q1 Q2 Q3 Q4

Q1 Q2 Q3 Q4

All

Q

Q,P

I1

I2

I3

Page 26: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Till Now

Aggregates simultaneously on multiple dimensions.

Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation.

Facilitates a-priori pruning.

During partitioning, each partition’s count is compared with min sup. The recursion stops if the count does not satisfy min sup.

April 21, 2023 26

Page 27: October 10, 20151 Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1)

Summary

Data Cube Materialization Full Materialization Partial Materialization: iceberg cubes, shell fragments

Data Cube Computation Methods Multiway array aggregation BUC for computing iceberg cubes

Next Class Star Cubing Shell Fragments for Fast High-Dimensional OLAP Exploration and Discovery in Multidimensional

Databases

April 21, 2023 27