ibm almaden research center © 2006 ibm corporation wringing a table dry: using csvzip to compress a...

23
IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret Swart

Upload: barbara-conley

Post on 01-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy

Vijayshankar Raman & Garret Swart

Page 2: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Oxide is cheap, so why compress?

Make better use of memory

– Increase capacity of in memory database

– Increase effective cache size of on disk database

Make better use of bandwidth

– I/O and memory bandwidth are expensive to scale

– ALU operations are cheap and getting cheaper

Minimize storage and replication costs

Page 3: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Why compress relations?

Relations are important for structured information

Text, video, audio, image compression is more advanced than relational

Statistical and structural properties of the relation can be exploited to improve compression

Relational data have special access patterns

– Don’t just “inflate.” Need to run selections, projections and aggregations

Page 4: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Our results

Near optimal compression of relational data

– Exploits data skew, column correlations and lack of ordering

– Theory: Compress m i.i.d. tuples to within 4.3 m bits of entropy (but theory doesn’t count dictionaries)

– Practice: Between 8 and 40x compression

Scanning compressed relational data

– Directly perform projections, equality and range selections, and joins on entropy compressed data

– Cache efficient dictionary usage

– Query short circuiting

Page 5: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

ThisTalk

Raw Data

CompressedData

Analyze

Meta Data &Dictionaries

Compress

Query

Results

Update

NewRaw Data

CSVZIP Flow

Analyze to determine compression plan

Compress to reduce size

Execute many queries over compressed data

Periodically update data and dictionaries

Page 6: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Sources of Redundancy in Relations Column Value space much smaller than

Domain– |C| << |domain(C)|– Type specific transformations, dictionaries

Skew in value frequency– H(C) << lg |C|– Entropy encoding (e.g. Huffman codes)

Column correlations within a tuple– H(C1, C2) << H(C1) + H(C2)– Column co-coding

Incidental tuple ordering– H({T1, T2, …, Tm}) ~ H(T1,T2, … , Tm) – m lg m– Sort and delta code

Tuple correlations– If correlated tuples share common columns, sort

first on those columns

{“Apple”, “Pear”, “Mango”} in

CHAR(10)

90% of fruits are “Apple”

Mangos are mainly sold in August

Mango buyers also buy paper towels

Page 7: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Male/John

Compression Process: Step 1 Input tuple

Column 1 Column 2

Co-codetransform

Type specifictransform

Column1 & 2

Column3.A

ColumnCode

TupleCode

ColumnCode

Column 3

Column3.B

ColumnCode

HuffmanEncode

Dict HuffmanEncode

DictHuffmanEncode

Dict

Male/John/Sat

Sat 2006

Male, John, 08/10/06, Mango

101101011 001 01011101

10110101100101011101

p = 1/512 p = 1/8 p = 1/512

w35/Mango

w35

Male John 08/10/06 Mango

Michael 4.2%

David 3.8%

James 3.6%

Robert 3.5%

John 3.5%

William 2.5%

Mark 2.4%

Richard 2.3%

Thomas 1.9%

Steven 1.5%

Mon Tue Wed Thu Fri Sat Sun

Male 3% 4% 10% 6% 23% 42% 12%

Female 4% 5% 9% 15% 17% 28% 22%

Page 8: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Compression Process: Step 2

First tuple code

Tuplecode

SortedTuplecodes1

PreviousTuplecode

Delta

HuffmanEncode

Delta Code

Append

Dict

CompressionBlock

10110101110000110010110101110001011111

1011010111000011101

10110101110001011101

10110101110001011101

0000000000000000001

000

000

00000000000000000001

010

010

0000000000000000101

1110

1110

Look Ma, no delimiters!101101011100010111010000101110

Page 9: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Compression Results

P1 – P6: Various projections of TPC-H tables

P7: SAP SEOCOMPODF

P8: TPC-E Customer

Page 10: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Huffman Code Scan operations

SELECT SUM(price) FROM Sale WHERE week(saleDate) = 23 AND fruit = “Mango” AND year(saleDate) between 1997 AND 2005

Scan this: 1011010110010101110101001

– Skip Over first column: Need length

– Range Compare on 2nd column: year in 1997 to 2005

– Equality Compare 3rd column: Week = 23, fruit = Mango

– Decode 4th column for aggregation

Segregated Coding: Faster operations, same compression

– Assign Huffman Codes in order of length• |code(v)| < |code(w)| code(v) < code(w)

– Sort codes within a length• |code(v)| = |code(w)| (v < w code(v) < code(w))

Year Code

2003 000

2006 001

1997 010000

1999 010001

2000 010010

2004 010011

1998 0101000

2001 0101001

2002 0101010

2005 0101011

Page 11: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Segregated Coding: Computing Code Length

One code length Constant function

– #define codeLen(w) 6

Second largest code length << lg L1 cache size Use lookup table

– #define codeLen(w) \ codeTable[x>>26]

Otherwise compare input with max code of each length

– #define codeLen(w) \ (w <= 0b00111111…)?3 \:(w <= 0b01001111…)?6 \:(w <= 0b01010111…)?7

… )))

Year Code

2003 000

2006 001

1997 010000

1999 010001

2000 010010

2004 010011

1998 0101000

2001 0101001

2002 0101010

2005 0101011

Page 12: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Segregated Coding: Range Query1002003009818022032287111190232256278298

switch (codeLen(w)) {

case 3: return w>>28 != 0;

302

case 4: return w >= 0b0111000000000000 && w <= 0b1000111111111111;

case 5: return w >= 0b1011000000000000&& w <= 0b1101111111111111;

}

333

Value code

0000010100110011110001001101001010110110101111100011001110101101111100SELECT * WHERE col BETWEEN 112 and 302

Page 13: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Advantages of Segregated Coding

Find code length quickly

– No access to dictionary

Fast Range query

– No access to dictionary for constant ranges

Cache Locality

– Because values are sorted by code length, commonly used values are clustered near the beginning of the array

– The beginning of the array is most likely to be in cache, improving the cache hit ratio

Page 14: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Query Short Circuiting

Reuse predicates and values that depend on unchanged columns

Sorting causes many unchanged columns

101101011100001100

1011010111000011101

0000000000000000101

+

Previous Tuple:

Delta Value:

Next Tuple:

Common Bits: 1011010111000011

Unchanged Columns:

Gender/FName

Reusedpredicates:

Sex = MaleName = JohnYear ≥ 2005

Reduces instructions but adds a branch!

Year

Page 15: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Selected Prior Work

Entropy Coding– Shannon (1948), Huffman (1952)

Arithmetic coding – Abramson (1963) Pasco, Rissanen (1976)

Row or Page Coding– Compress each row or page

independently. Decompress on page load or row touch. Compression code is localized. [Oracle, DB2, IMS]

Column-wise coding– Each column value gets a fixed

length code from a per column dictionary. [Sybase IQ, CStore, MonetDB]

– Pack multiple short values into 16 bit quantities and decode them as a unit to save CPU [Abadi/Madden/Ferreira]

Delta coding– Sort and difference or remove

common prefix from adjacent codes [Inverted Indices, B-trees, CStore]

Text coding– “gzip” style coding using n-grams,

Huffman codes, and sliding dictionaries [Ziv, Lempel, Welch, Katz]

Order preserving codes– Allows range queries at a cost in

compression [Hu/Tucker, Antoshenkov/Murray/Lomet, Zandi/Iyer/Langdon]

Lossy coding– Model based lossy compression:

SPARTAN, Vector quantization

Page 16: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Work in Progress Analysis to find best:

– Dictionaries that fit in L2 cache size

– Set of columns to co-code

– Column ordering for sort

Generate code for efficient queries on x86-64, Power5 and Cell

– Don’t interpret meta-data at run time

– Utilize architecture features

Update

– Incremental update of dictionaries. Background merge of new rows.

Release of CSVZIP utilities

Page 17: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Observations

Entropy decoding uses less I/O, but more ALU ops than conventional decoding

– Our technique removes the cache as a problem

– Have to squeeze every ALU op: Trends in favor

Variable length codes makes vectorization and out-of-order execution hard

– Exploit compression block parallelism instead

These techniques can be exploited in a column store

Page 18: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Back up

Page 19: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Entropy Encoding on a Column Store

Don’t build tuple code: Treat tuple as vector of column codes and sort lexicographically

Columns early in the sort: Run length encoded deltas

Columns in the middle of the sort: Entropy encoded deltas

Columns late in the sort: Concatenated column codes

Independently break columns into compression blocks

Make dictionaries bigger because only using one at a time

Page 20: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Entropy: A measure of information content

Entropy of a random variable R

– The expected number of bits needed to represent the outcome of R

– H(R) = ∑r domain(R) Pr(R = r) lg (1/ Pr(R = r))

Conditional entropy of R given S

– The expected number of bits needed to represent the outcome of R given we already know the outcome of S.

– H(R | S) = ∑s domain(S) ∑r domain(R) Pr(R = r & S = s)

– lg (1/ Pr(R = r & S = s)) – H(S)

If R is a random relation of size n, then R is a multi-set of random variables {T1, …, Tn} where each random tuple Ti is a cross product of random attributes C1i … Cki

Page 21: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

The Entropy of a Relation

We define a random relation R of size m over D as a random variable whose outcomes are multi-sets of size m where each element is chosen identically and independently from an arbitrary tuple distribution D. The results are dependent on H(D) and thus on the optimal encoding of tuples chosen from D.

– If we do a good job of co-coding and Huffman coding, then the tuple codes are entropy coded: They are random bit strings whose length depends on the distribution of the column values but whose entropy is equal to their length

Lemma 2: The Entropy of random relation R of size m over a distribution D is at least m H(D) – lg m!

Theorem 3: The Algorithm presented compresses a random relation R of size m to within H(R) + 4.3 m bits, if m > 100

Page 22: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Proof of Lemma 2

Let R be a random vector of m tuples i.i.d. over distribution D whose outcomes are sequences of m tuples, t1, …, tm.

Obviously H(R) is m H(D).

Consider an augmentation of R that adds an index to each tuple so that ti has the value i appended. Define R1 as a set consisting of exactly those values. H(R1) = m H(D) as there is a bijection between R1 and R

But the random multi-set R is a projection of the set R1 and there are exactly m! equal probability sets R1 that each project to each outcome of R so H(R1) ≤ H(R) + lg m! and thus H(R) ≥ m H(D) – lg m!

Page 23: IBM Almaden Research Center © 2006 IBM Corporation Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy Vijayshankar Raman & Garret

IBM Almaden Research Center

© 2006 IBM Corporation

Proof sketch of Theorem 3

Lemma 1 says: If R is random multi-set of m values over the uniform distribution 1..m and m > 100, then H(delta(sort(R))) < 2.67 m.

But we have values from an arbitrary distribution, so work by cases

– For values longer than lg m bits, truncate, getting a uniform distribution in the range.

– For values shorter than lg m bits, append random bits, also getting a uniform distribution.