prefixcube: prefix-sharing condensed data cube

24
PrefixCube: Prefix- sharing Condensed Data Cube Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. [email protected] Nov 12, 2004

Upload: margot

Post on 21-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

PrefixCube: Prefix-sharing Condensed Data Cube. Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. [email protected] Nov 12, 2004. Outline. Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PrefixCube: Prefix-sharing Condensed Data Cube

PrefixCube: Prefix-sharing Condensed Data

Cube

Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech.

[email protected]

Nov 12, 2004

Page 2: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 2 Jianlin Feng

Outline Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons Conclusions

Page 3: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 3 Jianlin Feng

Introduction Data Cube (ICDE’96)

– N-dimensional cube(A1, A2, …, AN)– 2N cuboids, i.e. GROUP-BYs

The Huge Size Problem– When R is sparse, the size of a cuboid is

possibly close to the size of R. – The I/O cost even for storing the cube

result tuples becomes dominative.

Page 4: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 4 Jianlin Feng

Related Work Condensed Cube (ICDE’02) Dwarf (SIGMOD’02) Quotient Cube (VLDB’02) QC-Tree (SIGMOD’03) Basic idea: remove redundancies

existing among cube tuples. – prefix redundancy – suffix redundancy

Page 5: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 5 Jianlin Feng

Prefix redundancy Given an example cube(A, B, C)

– Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC)

– Possibly many times in each cuboid except cuboid(A)

Inter-cuboid and Intra-cuboid prefix redundancy

Page 6: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 6 Jianlin Feng

Suffix Redundancy Occurs when cube tuples belonging to

different cuboids are actually aggregated from the same group of base relation tuples.

An extreme case – Let the source relation R have only one single

tuple r(a1, a2, …, an, m); – 2n cube tuples can be condensed into one

physical tuple: (a1, a2, …, an, V), where V = aggr(r);

– together with some information indicating that it is a representative tuple.

Page 7: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 7 Jianlin Feng

Thinking… Condensed cube

– It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size.

Dwarf– Besides suffix coalescing, i.e. multi-base-

tuple condensing, it also realized full prefix-sharing so as to achieve high cube size reducing effectiveness.

Page 8: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 8 Jianlin Feng

Motivation HOW to further reduce condensed

cube’s size while taking into account query characteristics we intend to answer - range query?

Augmenting BST-condensing with removing of intra-cuboid prefix redundancy!

Page 9: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 9 Jianlin Feng

Ordered Datacube Model Value ALL(or *) is encoded as 0. A dimension D and its cardinality C

– each dimension value is one-to-one mapped to an integer value between 1 and C inclusively.

N dimensions form a N-dimensional space.

The origin O(0, 0, …, 0) represents the grand total.

Page 10: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 10 Jianlin Feng

Ordered Datacube Model

Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries.

Page 11: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 11 Jianlin Feng

BST-Condensed Cube Base Single Tuple (BST)

– t1 is a BST on SD {A} and {B}– t2 is a BST on SD {B}

A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube.

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60

Page 12: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 12 Jianlin Feng

BU-BST Condensed Cube BottomUpBST algorithms (ICDE’02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple

from condensed cube compared with MinCube.

Note: BST Condensing is a special kind of Prefix-sharing !

A B C M8 * * 108 1 * 108 * 1 108 1 1 10

A B C M SD

ct7 8 1 1 10 {A}

A group of cube tuples with sharing

prefix are represented by a

BST!

Page 13: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 13 Jianlin Feng

A BU-BST Condensed Cube Example

A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60

A B C M SID CIDct1 * * * 210 ALLct2 1 * * 110 Act3 1 2 3 60 ABct4 1 8 1 50 ABct5 1 * 1 50 ACct6 1 * 3 60 ACct7 8 1 1 100 Act8 * 1 1 100 Bct9 * 2 3 60 B

ct10 * 8 1 50 Bct11 * * 1 150 Cct12 * * 3 60 C

Note:Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5

Page 14: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 14 Jianlin Feng

Prefix-sharing Condensed Cube - PrefixCube

BST Condensing BST Condensing ++

Intra-cuboid prefix-sharingIntra-cuboid prefix-sharing

Prefix-sharingPrefix-sharing

PrefixCubePrefixCube

Page 15: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 15 Jianlin Feng

A PrefixCube Example

8

SID = A SID = AB SID = B

1 2 8

1 2 8

1 50

3 60

1 50

3 60

1 100

1 110210 1 1 150 3 60

1 50 3 60

V-RootsN-Roots

1 100

CID = ALL CID = ACCID = A CID = A

1

Page 16: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 16 Jianlin Feng

Corresponding Dwarf

100

1 8 2

1 50 50

3 60 60

1 50 1103 60 1 150 2103 60

8

8 21

A Dimension

B Dimension

C Dimension

(node1)

(node2)

(node4)

(node3)

1

1 100

Page 17: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 17 Jianlin Feng

PrefixCube vs. DwarfPrefixCub

eDwarf

Prefix-sharing Intra-cuboid Inter- and Intra-cuboid

PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics!

Suffix Coalescing

BST Condensing

Multi-tuple Condensing

Compression Ratio

Lower Higher

Saving extra value ALL?

No Yes

Tuple clustered by

cuboid?

Yes No

Page 18: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 18 Jianlin Feng

Effectiveness of Size Reduction

Datasets– synthetic datasets with uniform distribution– # of tuples: 1,000,000

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCube

(a) Cardinality = 100 (b) Cardinality = 1000

Page 19: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 19 Jianlin Feng

Effectiveness of Size Reduction

PrefixBUC– Full Cube (computed by BUC) – Prefix-sharing

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

C=100C=1000

Page 20: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 20 Jianlin Feng

Impact of Data Density Datasets

– Uniform distribution– # of dimensions: 6– Cardinality of dimensions: 100– # of tuples: range from 1,000 to 1,000,000

0%

20%

40%

60%

80%

100%

1.E+03 1.E+04 1.E+05 1.E+06

Number of Tuples

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

Page 21: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 21 Jianlin Feng

Impact of Data Skewness Datasets

– Zipf distribution– # of tuples: 1,000,000– Cardinality of dimensions: range from 1,000 to 500 with

100 interval– Zipf factor: range from 0 to 0.8 with 0.2 interval

0%

20%

40%

60%

80%

100%

0 0.2 0.4 0.6 0.8

Zipf Factors

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

Page 22: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 22 Jianlin Feng

Real-world Dataset Datasets

– Weather Datasets– # of tuples: 1,015,367

0

100

200

300

400

500

600

700

2 3 4 5 6 7 8 9

Number of Dimensions

Tim

e(se

c.)

BUCBU-BSTPrefixCube

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9

Number of Dimensions

Size

Rat

io

BU-BSTPrefixCubePrefixBUC

Page 23: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 23 Jianlin Feng

Conclusion A new cube structure PrefixCube was

proposed by augmenting BU-BST condensing with intra-cuboid prefix-sharing.– It can greatly reduce data cube’s size

compared with BU-BST condensed cube.– It can also reduce the impact of data skew

on BU-BST condensing.– It can make a quite stable size reduction

on both dense and sparse datasets.

Page 24: PrefixCube: Prefix-sharing Condensed Data Cube

DOLAP 2004 24 Jianlin Feng

The End

Thank u!

Any question?