prefixcube: prefix-sharing condensed data cube

PrefixCube: Prefix-sharing Condensed Data

Cube

Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech.

[email protected]

Nov 12, 2004

DOLAP 2004 2 Jianlin Feng

Outline Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons Conclusions


Introduction Data Cube (ICDE’96)

– N-dimensional cube(A1, A2, …, AN)– 2N cuboids, i.e. GROUP-BYs

The Huge Size Problem– When R is sparse, the size of a cuboid is

possibly close to the size of R. – The I/O cost even for storing the cube

result tuples becomes dominative.


Related Work Condensed Cube (ICDE’02) Dwarf (SIGMOD’02) Quotient Cube (VLDB’02) QC-Tree (SIGMOD’03) Basic idea: remove redundancies

existing among cube tuples. – prefix redundancy – suffix redundancy


Prefix redundancy Given an example cube(A, B, C)

– Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC)

– Possibly many times in each cuboid except cuboid(A)

Inter-cuboid and Intra-cuboid prefix redundancy


Suffix Redundancy Occurs when cube tuples belonging to

different cuboids are actually aggregated from the same group of base relation tuples.

An extreme case – Let the source relation R have only one single

tuple r(a1, a2, …, an, m); – 2n cube tuples can be condensed into one

physical tuple: (a1, a2, …, an, V), where V = aggr(r);

– together with some information indicating that it is a representative tuple.


Thinking… Condensed cube

– It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size.

Dwarf– Besides suffix coalescing, i.e. multi-base-

tuple condensing, it also realized full prefix-sharing so as to achieve high cube size reducing effectiveness.


Motivation HOW to further reduce condensed

cube’s size while taking into account query characteristics we intend to answer - range query?

Augmenting BST-condensing with removing of intra-cuboid prefix redundancy!


Ordered Datacube Model Value ALL(or *) is encoded as 0. A dimension D and its cardinality C

– each dimension value is one-to-one mapped to an integer value between 1 and C inclusively.

N dimensions form a N-dimensional space.

The origin O(0, 0, …, 0) represents the grand total.


Ordered Datacube Model

Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries.


BST-Condensed Cube Base Single Tuple (BST)

– t1 is a BST on SD {A} and {B}– t2 is a BST on SD {B}

A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube.

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60


BU-BST Condensed Cube BottomUpBST algorithms (ICDE’02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple

from condensed cube compared with MinCube.

Note: BST Condensing is a special kind of Prefix-sharing !

A B C M8 * * 108 1 * 108 * 1 108 1 1 10

A B C M SD

ct7 8 1 1 10 {A}

A group of cube tuples with sharing

prefix are represented by a

BST!


A BU-BST Condensed Cube Example

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60

A B C M SID CIDct1 * * * 210 ALLct2 1 * * 110 Act3 1 2 3 60 ABct4 1 8 1 50 ABct5 1 * 1 50 ACct6 1 * 3 60 ACct7 8 1 1 100 Act8 * 1 1 100 Bct9 * 2 3 60 B

ct10 * 8 1 50 Bct11 * * 1 150 Cct12 * * 3 60 C

Note:Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5


Prefix-sharing Condensed Cube - PrefixCube

BST Condensing BST Condensing ++

Intra-cuboid prefix-sharingIntra-cuboid prefix-sharing

Prefix-sharingPrefix-sharing

PrefixCubePrefixCube


A PrefixCube Example

8

SID = A SID = AB SID = B

1 2 8

1 2 8

1 50

3 60

1 50

3 60

1 100

1 110210 1 1 150 3 60

1 50 3 60

V-RootsN-Roots

1 100

CID = ALL CID = ACCID = A CID = A

1


Corresponding Dwarf

100

1 8 2

1 50 50

3 60 60

1 50 1103 60 1 150 2103 60

8

8 21

A Dimension

B Dimension

C Dimension

(node1)

(node2)

(node4)

(node3)

1

1 100


PrefixCube vs. DwarfPrefixCub

eDwarf

Prefix-sharing Intra-cuboid Inter- and Intra-cuboid

PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics!

Suffix Coalescing

BST Condensing

Multi-tuple Condensing

Compression Ratio

Lower Higher

Saving extra value ALL?

No Yes

Tuple clustered by

cuboid?

Yes No


Effectiveness of Size Reduction

Datasets– synthetic datasets with uniform distribution– # of tuples: 1,000,000

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9


Size

Rat

io

BU-BSTPrefixCube

(a) Cardinality = 100 (b) Cardinality = 1000


Effectiveness of Size Reduction

PrefixBUC– Full Cube (computed by BUC) – Prefix-sharing

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9


Size

Rat

io

C=100C=1000


Impact of Data Density Datasets

– Uniform distribution– # of dimensions: 6– Cardinality of dimensions: 100– # of tuples: range from 1,000 to 1,000,000

0%

20%

40%

60%

80%

100%

1.E+03 1.E+04 1.E+05 1.E+06

Number of Tuples

Size

Rat

io

BU-BSTPrefixCubePrefixBUC


Impact of Data Skewness Datasets

– Zipf distribution– # of tuples: 1,000,000– Cardinality of dimensions: range from 1,000 to 500 with

100 interval– Zipf factor: range from 0 to 0.8 with 0.2 interval

0%

20%

40%

60%

80%

100%

0 0.2 0.4 0.6 0.8

Zipf Factors

Size

Rat

io



Real-world Dataset Datasets

– Weather Datasets– # of tuples: 1,015,367

0

100

200

300

400

500

600

700

2 3 4 5 6 7 8 9


Tim

e(se

c.)

BUCBU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9


Size

Rat

io



Conclusion A new cube structure PrefixCube was

proposed by augmenting BU-BST condensing with intra-cuboid prefix-sharing.– It can greatly reduce data cube’s size

compared with BU-BST condensed cube.– It can also reduce the impact of data skew

on BU-BST condensing.– It can make a quite stable size reduction

on both dense and sparse datasets.


The End

Thank u!

Any question?

prefixcube: prefix-sharing condensed data cube

Documents