prefixcube: prefix-sharing condensed data cube

Post on 21-Feb-2016

49 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

PrefixCube: Prefix-sharing Condensed Data Cube. Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. fengjl@mail.hust.edu.cn Nov 12, 2004. Outline. Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons - PowerPoint PPT Presentation

TRANSCRIPT

PrefixCube: Prefix-sharing Condensed Data

Cube

Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech.

fengjl@mail.hust.edu.cn

Nov 12, 2004

DOLAP 2004 2 Jianlin Feng

Outline Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons Conclusions

DOLAP 2004 3 Jianlin Feng

Introduction Data Cube (ICDE’96)

– N-dimensional cube(A1, A2, …, AN)– 2N cuboids, i.e. GROUP-BYs

The Huge Size Problem– When R is sparse, the size of a cuboid is

possibly close to the size of R. – The I/O cost even for storing the cube

result tuples becomes dominative.

DOLAP 2004 4 Jianlin Feng

Related Work Condensed Cube (ICDE’02) Dwarf (SIGMOD’02) Quotient Cube (VLDB’02) QC-Tree (SIGMOD’03) Basic idea: remove redundancies

existing among cube tuples. – prefix redundancy – suffix redundancy

DOLAP 2004 5 Jianlin Feng

Prefix redundancy Given an example cube(A, B, C)

– Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC)

– Possibly many times in each cuboid except cuboid(A)

Inter-cuboid and Intra-cuboid prefix redundancy

DOLAP 2004 6 Jianlin Feng

Suffix Redundancy Occurs when cube tuples belonging to

different cuboids are actually aggregated from the same group of base relation tuples.

An extreme case – Let the source relation R have only one single

tuple r(a1, a2, …, an, m); – 2n cube tuples can be condensed into one

physical tuple: (a1, a2, …, an, V), where V = aggr(r);

– together with some information indicating that it is a representative tuple.

DOLAP 2004 7 Jianlin Feng

Thinking… Condensed cube

– It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size.

Dwarf– Besides suffix coalescing, i.e. multi-base-

tuple condensing, it also realized full prefix-sharing so as to achieve high cube size reducing effectiveness.

DOLAP 2004 8 Jianlin Feng

Motivation HOW to further reduce condensed

cube’s size while taking into account query characteristics we intend to answer - range query?

Augmenting BST-condensing with removing of intra-cuboid prefix redundancy!

DOLAP 2004 9 Jianlin Feng

Ordered Datacube Model Value ALL(or *) is encoded as 0. A dimension D and its cardinality C

– each dimension value is one-to-one mapped to an integer value between 1 and C inclusively.

N dimensions form a N-dimensional space.

The origin O(0, 0, …, 0) represents the grand total.

DOLAP 2004 10 Jianlin Feng

Ordered Datacube Model

Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries.

DOLAP 2004 11 Jianlin Feng

BST-Condensed Cube Base Single Tuple (BST)

– t1 is a BST on SD {A} and {B}– t2 is a BST on SD {B}

A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube.

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60

DOLAP 2004 12 Jianlin Feng

BU-BST Condensed Cube BottomUpBST algorithms (ICDE’02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple

from condensed cube compared with MinCube.

Note: BST Condensing is a special kind of Prefix-sharing !

A B C M8 * * 108 1 * 108 * 1 108 1 1 10

A B C M SD

ct7 8 1 1 10 {A}

A group of cube tuples with sharing

prefix are represented by a

BST!

DOLAP 2004 13 Jianlin Feng

A BU-BST Condensed Cube Example

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60

A B C M SID CIDct1 * * * 210 ALLct2 1 * * 110 Act3 1 2 3 60 ABct4 1 8 1 50 ABct5 1 * 1 50 ACct6 1 * 3 60 ACct7 8 1 1 100 Act8 * 1 1 100 Bct9 * 2 3 60 B

ct10 * 8 1 50 Bct11 * * 1 150 Cct12 * * 3 60 C

Note:Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5

DOLAP 2004 14 Jianlin Feng

Prefix-sharing Condensed Cube - PrefixCube

BST Condensing BST Condensing ++

Intra-cuboid prefix-sharingIntra-cuboid prefix-sharing

Prefix-sharingPrefix-sharing

PrefixCubePrefixCube

DOLAP 2004 15 Jianlin Feng

A PrefixCube Example

8

SID = A SID = AB SID = B

1 2 8

1 2 8

1 50

3 60

1 50

3 60

1 100

1 110210 1 1 150 3 60

1 50 3 60

V-RootsN-Roots

1 100

CID = ALL CID = ACCID = A CID = A

1

DOLAP 2004 16 Jianlin Feng

Corresponding Dwarf

100

1 8 2

1 50 50

3 60 60

1 50 1103 60 1 150 2103 60

8

8 21

A Dimension

B Dimension

C Dimension

(node1)

(node2)

(node4)

(node3)

1

1 100

DOLAP 2004 17 Jianlin Feng

PrefixCube vs. DwarfPrefixCub

eDwarf

Prefix-sharing Intra-cuboid Inter- and Intra-cuboid

PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics!

Suffix Coalescing

BST Condensing

Multi-tuple Condensing

Compression Ratio

Lower Higher

Saving extra value ALL?

No Yes

Tuple clustered by

cuboid?

Yes No

DOLAP 2004 18 Jianlin Feng

Effectiveness of Size Reduction

Datasets– synthetic datasets with uniform distribution– # of tuples: 1,000,000

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCube

(a) Cardinality = 100 (b) Cardinality = 1000

DOLAP 2004 19 Jianlin Feng

Effectiveness of Size Reduction

PrefixBUC– Full Cube (computed by BUC) – Prefix-sharing

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

C=100C=1000

DOLAP 2004 20 Jianlin Feng

Impact of Data Density Datasets

– Uniform distribution– # of dimensions: 6– Cardinality of dimensions: 100– # of tuples: range from 1,000 to 1,000,000

0%

20%

40%

60%

80%

100%

1.E+03 1.E+04 1.E+05 1.E+06

Number of Tuples

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

DOLAP 2004 21 Jianlin Feng

Impact of Data Skewness Datasets

– Zipf distribution– # of tuples: 1,000,000– Cardinality of dimensions: range from 1,000 to 500 with

100 interval– Zipf factor: range from 0 to 0.8 with 0.2 interval

0%

20%

40%

60%

80%

100%

0 0.2 0.4 0.6 0.8

Zipf Factors

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

DOLAP 2004 22 Jianlin Feng

Real-world Dataset Datasets

– Weather Datasets– # of tuples: 1,015,367

0

100

200

300

400

500

600

700

2 3 4 5 6 7 8 9

Number of Dimensions

Tim

e(se

c.)

BUCBU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

DOLAP 2004 23 Jianlin Feng

Conclusion A new cube structure PrefixCube was

proposed by augmenting BU-BST condensing with intra-cuboid prefix-sharing.– It can greatly reduce data cube’s size

compared with BU-BST condensed cube.– It can also reduce the impact of data skew

on BU-BST condensing.– It can make a quite stable size reduction

on both dense and sparse datasets.

DOLAP 2004 24 Jianlin Feng

The End

Thank u!

Any question?

top related