october 10, 20151 efficient methods for data cube computation and data generalization chapter 4...
TRANSCRIPT
April 21, 2023 1
Efficient Methods for Data Cube Computation and Data
Generalization
Chapter 4 (4.1)
Data Generalization
It is a process of abstracting conceptual level knowledge from large set of task-relevant data.
Two types of analysis : Descriptive data mining: Describes data in a concise
manner, highlighting interesting general properties. Supports interest.
Predictive data mining: constructs a model and attempts to predict behavior of new data. ( classification, regression….)
April 21, 2023 2
April 21, 2023 3
A Data Cube (MOLAP)
• Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are precomputed.
• Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions.
• For many cells in a cuboid the measure value is zero and cells are of little or no interest. Cuboids are often sparse.
Partial Materialization Precomputation of some of the cuboids in advance
leads to fast response time and avoids redundant computations during on-line analytical processing.
Data cube materialization/ pre-computation No materialization: Don’t precompute any of the
non-base cuboid. Leads to multidimensional aggregation on the fly and is slow.
Full materialization: Precompute all the cubes. Running queries will be very fast. Requires huge memory.
Partial Materialization: Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion.
April 21, 2023 4
April 21, 2023 5
Outline
Types of cells : Base cell, aggregate cell, cell relationship
Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell
Cube
Efficient Computation of Data Cubes
Multiway Array Aggregation
BUC
Star Cubing
April 21, 2023 6
A Data Cube: sales
10 11 12 3 10 1
11 9 6 9 6 7
12 9 8 5 7 3
I1 I2 I3
New York
Chicago
Toronto
47
48
44
All
46 37 36 22 29 14 184All
Bra
nc
h
Product
Aggregate cell Base cell Apex CuboidProduct cuboid
Branch cuboidBase cuboid
Vancouver
I5 I6I4
5 8 1013 6 3 45
April 21, 2023 7
- Types of cells
Types of cells Base cell: a cell which belongs to a base cuboid Aggregate cell: a cell which belongs to a non-base cuboid Each aggregate dimension is indicated by a “*”
Ancestor-descendent relationship between cells: dimensions are (branch, product, year)
1-D cell c1 = (New York, *, *, 2000) is an ancestor of a
2-D cell c2 = (New York, I1, *, 400) and a 3-D cell c3 = (New York, I1, 2013, 111). c3 is a descendent of c1 and c2;
In an n-D data cube an i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a j-D cell b=(b1,b2,…bn,measure_b) if
1) i<j and 2) for 1≤m≤n am=bm whenever am
3) if j=i+1 a is called parent of b or b is a child of a""
April 21, 2023 8
- Types of cubes
Full cube: All cells and cuboids are materialized. All possible combination of dimensions and values.
or
Iceberg cube: Partial materialization. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. count(*) >= min support Iceberg Condition
Closed cube: No ancestor cell is created if its measure is equal to that of its descendent cell.
Shell cube: Only cuboids with limited number of dimensions are created.
n2 11
n
iiL
Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10}
How many sub-patterns for first base cell
Total number of aggregate cells is Ignore all of the aggregate cells that can be obtained by
replacing some constants by “*” while keeping the same measure value.
Only 3 really offer new information. {(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20}
April 21, 2023 9
62101
Example
Which are the closed cells?
Similarly we can also get
April 21, 2023 10
can be derived from the closed cell
11
Iceberg Cube, Closed Cube & Cube Shell
Is iceberg cube good enough? 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!
Close cube: Closed cell c: if there exists no cell d, s.t. d is a descendant of
c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only
3 cells Cube Shell
Precompute only the cuboids involving a small # of dimensions, e.g., 3
More dimension combinations will need to be computed on the fly
For (A1, A2, … A10), how many combinations to compute?
April 21, 2023 12
Outline
Types of cells
Types of Cubes
Efficient Computation of Data Cubes
Multiway Array Aggregation
BUC
Star Cubing
April 21, 2023 13
- Efficient Computation of Data Cubes
Preliminary cube computation tricks
Computing full/iceberg cubes: 2 methodologies
Top-Down: Multi-Way array aggregation
Bottom-Up: Bottom-up computation: BUC Star-Cubing: Integrates top-down and bottom-up
April 21, 2023 14
-- Preliminary Cube Computation Tricks
Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. (ROLAP)
Aggregates may be computed from previously computed aggregates, rather than from the base fact table
Cache-results: accumulating results of already computed cuboid to reduce disk I/Os. Higher-level aggregates are computed from lower-level aggregates rather than base facts.
Smallest-child: computing a cuboid from the smallest, previously computed cuboid. Cbranch C{ branch, year}, C{branch, item}
Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads
Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used
Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used
April 21, 2023 15
-- Multi-Way Array Aggregation …
Used for MOLAP and full cube computation
Array-based “bottom-up” algorithm
Using multi-dimensional chunks
Simultaneous aggregation on multiple dimensions
Intermediate aggregate values are re-used for computing ancestor cuboids
Cannot do Apriori pruning: No iceberg optimization
a ll
A B
A B
A B C
A C B C
C
April 21, 2023 16
… -- Multi-way Array Aggregation …
Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the
order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
What is the best traversing order to do multi-way aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
B
4428 56
4024 52
3620
60
April 21, 2023 17
… -- Multi-way Array Aggregation …
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
April 21, 2023 18
… -- Multi-way Array Aggregation …
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
AB requires longest scan, i.e scanning of 49th chunk
April 21, 2023 19
… -- Multi-way Array Aggregation …
Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively.
Therefore AB is the smallest and AC is the largest 2-D planes
If chunks are scanned as 1, 2, 3, … then 156,000 memory units are needed (40*400+40*1000+100*1000)
If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400*4000+40*1000+10*10*100)
April 21, 2023 20
… -- Multi-way Array Aggregation …
a ll
A B
A B
A B C
A C B C
C
ABC
All
A B C
BCACAB
Needs 156,000Memory units
Needs 1,641,000 Memory units
April 21, 2023 21
… -- Multi-way Array Aggregation
Method: the planes should be sorted and computed according to their size in ascending order
Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small number of dimensions
If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored
April 21, 2023 22
-- Bottom-Up Computation (BUC) …
Bottom-up cube computation
(Note: top-down in our view!)
Divides dimensions into partitions and facilitates iceberg pruning
If a partition does not satisfy min_sup, its descendants can be pruned
If minsup = 1 compute full CUBE!
No simultaneous aggregation
a ll
A B C
A C B C
A B C A B D A C D B C D
A D B D C D
D
A B C D
A B
1 a ll
2 A 1 0 B 1 4 C
7 A C 1 1 B C
4 A B C 6 A B D 8 A C D 1 2 B C D
9 A D 1 3 B D 1 5 C D
1 6 D
5 A B C D
3 A B
2323
BUC: Partitioning
Usually, entire data set can’t fit in main memory
Sort distinct values partition into blocks that fit
Continue processing Optimizations
Partitioning External Sorting, Hashing, Counting
Sort Ordering dimensions to encourage
pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the
partitions-greater pruning opportunity Collapsing duplicates
Can’t do holistic aggregates anymore!
Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first.
April 21, 2023 24
--- BUC: Example (Having count(*) > 5) …
5 1 0 1
1 0 8 9
1 1 1 8
5 1 0 1
1 0 8 9
1 1 1 8
Q1 Q2 Q3 Q4
I1
I2
I3
New York
Toronto
3 1 1 2
8 1 2 1
2 1 11 8
New-York
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
I1
I2
I3
Toronto
1
1
8
12
3
I1
I2
I3
April 21, 2023 25
… --- BUC: Example (Having count(*) > 5)
77
20 5 23 29
All
B,P,Q
P,B Q,B Q,P
B P Q
1
2
3
7
46
5
8 2 1 3
9 1 10 10
3 2 12 16
Q1 Q2 Q3 Q4
Q1 Q2 Q3 Q4
All
Q
Q,P
I1
I2
I3
Till Now
Aggregates simultaneously on multiple dimensions.
Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation.
Facilitates a-priori pruning.
During partitioning, each partition’s count is compared with min sup. The recursion stops if the count does not satisfy min sup.
April 21, 2023 26
Summary
Data Cube Materialization Full Materialization Partial Materialization: iceberg cubes, shell fragments
Data Cube Computation Methods Multiway array aggregation BUC for computing iceberg cubes
Next Class Star Cubing Shell Fragments for Fast High-Dimensional OLAP Exploration and Discovery in Multidimensional
Databases
April 21, 2023 27