gt 4420/6422 // spring 2019 // @joy arulraj lecture #4 ...jarulraj/courses/4420-s... · run-length...

86
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #4: SYSTEM CATALOGS & DATABASE COMPRESSION

Upload: others

Post on 24-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ

LECTURE #4: SYSTEM CATALOGS & DATABASE COMPRESSION

Page 2: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OFFICE HOURS

Prashanth Dintyala→ Office Hours: Mon, 1:30-2:30 PM→ Location: Near KACB 3324→ Email: [email protected]

Sonia Matthew→ Office Hours: Wed, 11-12 AM→ Location: Near KACB 3324→ Email: [email protected]

2

Page 3: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

HOMEWORK #1

3

We have posted clarifications on Piazza→ More clarifications will be provided over time

Separately submit the design document on Gradescope→ Homework 1 - Design Doc

Page 4: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

TODAY’S AGENDA

System Catalogs

Compression BackgroundNaïve CompressionOLAP Columnar CompressionOLTP Index Compression

4

Page 5: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

SYSTEM CATALOGS

Group of tables and views that describe the structure of the database.→ Statistics related to queries

Each system catalog table contains information about specific elements in the database.→ Statistics related to queries→ Statistics related to data distribution→ List of indexes→ List of user tables

5

Page 6: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

SYSTEM CATALOGS

Almost every DBMS stores their a database's catalog in itself (i.e., using it storage manager).→ Wrap object abstraction around tuples to avoid invoking

SQL queries to retrieve catalog data→ Specialized code for "bootstrapping" catalog tables.

The entire DBMS should be aware of transactions in order to automatically provide ACID guarantees for DDL (i.e., schema change) commands and concurrent txns.

6

Page 7: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

SYSTEM CATALOGS

MySQL→ Special hard-coded scripts to alter the catalog→ Non-transactional catalog changes

PostgreSQL→ Uses SQL commands to alter the catalog

7

Page 8: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

SCHEMA CHANGES

ADD COLUMN:→ NSM: Copy tuples into new region in memory.→ DSM: Just create the new column segment DROP COLUMN:→ NSM #1: Copy tuples into new region of memory.→ NSM #2: Mark column as "deprecated", clean up later.→ DSM: Just drop the column and free memory.CHANGE COLUMN:→ Check whether the conversion is allowed to happen.

Depends on default values.

8

Page 9: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

SCHEMA CHANGES

Relevant comment on HackerNews:→ Schema changes fail to complete on databases > 2 TB→ Operation requires double the amount of disk storage for

copying. Takes close to a month to perform such an operation on large tables

→ The big reason that DDL is slow is because these systems haven't tried to make it fast. So, blame the DB designers!

→ Wish it was on my day job list of things I can work on.

9

Page 10: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INDEXES

CREATE INDEX:→ Scan the entire table and populate the index.→ Must not block all other txns during index construction.→ Have to record changes made by txns that modified the

table while another txn was building the index.→ When the scan completes, lock the table and resolve

changes that were missed after the scan started.DROP INDEX:→ Just drop the index logically from the catalog.→ It only becomes "invisible" when the txn that dropped it

commits. All existing txns will still have to update it.

10

Page 11: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OBSERVATION

I/O is the main bottleneck if the DBMS has to fetch data from disk.→ CPU cost for decompressing data < I/O cost for fetching

un-compressed data. Compression always helps.

In-memory DBMSs are more complicated→ Compressing the database reduces DRAM requirements

and processing.Key trade-off is speed vs. compression ratio→ In-memory DBMSs (always?) choose speed.

11

Page 12: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

REAL-WORLD DATA CHARACTERISTICS

Data sets tend to have highly skeweddistributions for attribute values.→ Example: Zipfian distribution of the Brown Corpus→ Words like “the”, “a” occur very frequently in books

Data sets tend to have high correlation between attributes of the same tuple.→ Example: Order Date to Ship Date (few days)→ (June 5, +5) instead of (June 5, June 10)

12

Page 13: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DATABASE COMPRESSION

Goal #1: Must produce fixed-length values. Allows us to be efficient while accessing tuples.

Goal #2: Allow the DBMS to postpone decompression as long as possible during query execution. Operate directly on compressed data.

Goal #3: Must be a lossless scheme. No data should be lost during this transformation.

13

Page 14: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

LOSSLESS VS. LOSSY COMPRESSION

When a DBMS uses compression, it is always lossless because people don’t like losing data.

Any kind of lossy compression is has to be performed at the application level.→ Example: Sensor data. Readings are taken every second,

but we may only store average across every minute.

Some new DBMSs support approximate queries→ Example: BlinkDB, SnappyData, XDB, Oracle (2017)

14

Page 15: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ZONE MAPS

Pre-computed aggregates for blocks of data.DBMS can check the zone map first to decide whether it wants to access the block.

15

Original Data

val100200300400400

Page 16: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ZONE MAPS

Pre-computed aggregates for blocks of data.DBMS can check the zone map first to decide whether it wants to access the block.

16

Zone Map

val1004002801400

typeMINMAXAVGSUM

5COUNT

Original Data

val100200300400400

Page 17: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ZONE MAPS

Pre-computed aggregates for blocks of data.DBMS can check the zone map first to decide whether it wants to access the block.

17

Zone Map

val1004002801400

typeMINMAXAVGSUM

5COUNT

Original Data

val100200300400400

SELECT * FROM tableWHERE val > 600

Page 18: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

COMPRESSION GRANULARITY

Choice #1: Block-level→ Compress a block of tuples for the same table.Choice #2: Tuple-level→ Compress the contents of the entire tuple (NSM-only).Choice #3: Attribute-level→ Compress a single attribute value within one tuple.→ Can target multiple attributes for the same tuple.Choice #4: Column-level→ Compress multiple values for one or more attributes

stored for multiple tuples (DSM-only).

18

Page 19: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

NAÏVE COMPRESSION

Compress data using a general purpose algorithm. Scope of compression is only based on the data provided as input.→ LZO (1996), LZ4 (2011), Snappy (2011), Zstd (2015)

Considerations→ Computational overhead (gzip is super slow)→ Compress vs. decompress speed.

19

Page 20: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

NAÏVE COMPRESSION

Choice #1: Entropy Encoding→ More common sequences use less bits to encode, less

common sequences use more bits to encode.

Choice #2: Dictionary Encoding→ Build a lookup table that maps logical identifiers to data

chunks. Example: 1 ~ “the”→ Replace those values in the original data with logical

identifiers which can be later uncompressed using the lookup table. Example: <1, 5, 3> ~ <“the lookup table”>

20

Page 21: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

21

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Page 22: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

22

[1,2,4,8] KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Page 23: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

23

[1,2,4,8] KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0modification log

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Page 24: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

24

[1,2,4,8] KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0modification log

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Updates

Page 25: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

25

[1,2,4,8] KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Uncompressedpage0

Compressed page0modification log

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Updates

Page 26: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MYSQL INNODB COMPRESSION

26

16 KB

[1,2,4,8] KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Uncompressedpage0

Compressed page0modification log

Compressed page0modification log

Compressed page1modification log

Compressed page2modification log

Updates

Page 27: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

NAÏVE COMPRESSION

The data has to be decompressed first before it can be read and (potentially) modified.→ This limits the “scope” of the compression scheme.

These schemes also do not consider the high-level meaning or semantics of the data. → Example: Relationship between order and shipping dates

27

Page 28: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OBSERVATION

We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way.→ Range predicates are more tricky…

28

Page 29: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OBSERVATION

We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way.→ Range predicates are more tricky…

29

SELECT * FROM usersWHERE name = 'Andy'

NAME SALARYAndy 99999

Prashanth 88888

Page 30: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OBSERVATION

We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way.→ Range predicates are more tricky…

30

SELECT * FROM usersWHERE name = 'Andy'

NAME SALARYAndy 99999

Prashanth 88888

NAME SALARYXX AAYY BB

Page 31: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

OBSERVATION

We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way.→ Range predicates are more tricky…

31

SELECT * FROM usersWHERE name = 'Andy'

SELECT * FROM usersWHERE name = XX

NAME SALARYAndy 99999

Prashanth 88888

NAME SALARYXX AAYY BB

Page 32: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

COLUMNAR COMPRESSION

Compression Schemes→ Run-length Encoding→ Bitmap Encoding→ Delta Encoding→ Incremental Encoding→ Mostly Encoding→ Dictionary Encoding

32

Page 33: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

Compress runs of the same value in a single column into triplets:→ Example: [‘Atlanta’, ‘Atlanta’, ‘Atlanta’] ~ [‘Atlanta’, 3]→ The value of the attribute.→ The # of elements in the run.→ The start position in the column segment.

Requires the columns to be sorted intelligently to maximize compression opportunities.

33

DATABASE COMPRESSIONSIGMOD RECORD 1993

Page 34: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

34

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

Page 35: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

35

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

Page 36: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

36

Compressed Data

id

21

43

76

98

sex

(F,3,1)(M,0,3)

(F,5,1)(M,4,1)

(M,6,2)

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

RLE Triplet- Value- Offset- Length

Page 37: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

37

Compressed Data

id

21

43

76

98

sex

(F,3,1)(M,0,3)

(F,5,1)(M,4,1)

(M,6,2)

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

RLE Triplet- Value- Offset- Length

Page 38: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

38

Compressed DataSorted Data

id

21

63

98

74

sex

MM

MM

MM

FF

RLE Triplet- Value- Offset- Length

Page 39: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

RUN-LENGTH ENCODING

39

Compressed DataSorted Data

id

21

63

98

74

sex

MM

MM

MM

FF

id

21

63

97

74

sex

(F,7,2)(M,0,6)

RLE Triplet- Value- Offset- Length

Page 40: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING

Store a separate Bitmap for each unique value for a particular attribute where an offset in the vector corresponds to a tuple.→ The ith position in the Bitmap corresponds to the ith tuple

in the table.→ Typically segmented into chunks to avoid allocating large

blocks of contiguous memory.

Only practical if the value cardinality is low.

40

MODEL 204 ARCHITECTURE AND PERFORMANCEHigh Performance Transaction Systems 1987

Page 41: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING

41

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

Page 42: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING

42

Original Data

id

21

43

76

98

sex

MM

FM

FM

MM

Page 43: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING

43

Compressed DataOriginal Data

id

21

43

76

98

sex

MM

FM

FM

MM

id

21

43

76

98

M

11

01

01

11

F

00

10

10

00

sex

Page 44: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING

44

Compressed DataOriginal Data

id

21

43

76

98

sex

MM

FM

FM

MM

id

21

43

76

98

M

11

01

01

11

F

00

10

10

00

sex

Page 45: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

45

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 46: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

46

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 47: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

Assume we have 10 million tuples.43,000 zip codes in the US.→ 10000000 × 32-bits = 40 MB→ 10000000 × 43000 = 53.75 GB

Every time a txn inserts a new tuple, we have to extend 43,000 different bitmaps.

47

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 48: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

Assume we have 10 million tuples.43,000 zip codes in the US.→ 10000000 × 32-bits = 40 MB→ 10000000 × 43000 = 53.75 GB

Every time a txn inserts a new tuple, we have to extend 43,000 different bitmaps.

48

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 49: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

Assume we have 10 million tuples.43,000 zip codes in the US.→ 10000000 × 32-bits = 40 MB→ 10000000 × 43000 = 53.75 GB

Every time a txn inserts a new tuple, we have to extend 43,000 different bitmaps.

49

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 50: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

BITMAP ENCODING: EXAMPLE

Assume we have 10 million tuples.43,000 zip codes in the US.→ 10000000 × 32-bits = 40 MB→ 10000000 × 43000 = 53.75 GB

Every time a txn inserts a new tuple, we have to extend 43,000 different bitmaps.

50

CREATE TABLE customer_dim (id INT PRIMARY KEY,name VARCHAR(32),email VARCHAR(64),address VARCHAR(64),zip_code INT

);

Page 51: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DELTA ENCODING

Recording the difference between values that follow each other in the same column.→ The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

51

Original Datatime

12:0112:00

12:0312:02

12:04

temp

99.499.5

99.699.5

99.4

Page 52: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DELTA ENCODING

Recording the difference between values that follow each other in the same column.→ The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

52

Original Datatime

12:0112:00

12:0312:02

12:04

temp

99.499.5

99.699.5

99.4

Page 53: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DELTA ENCODING

Recording the difference between values that follow each other in the same column.→ The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

53

Original Datatime

12:0112:00

12:0312:02

12:04

temp

99.499.5

99.699.5

99.4

Compressed Datatime

+112:00

+1+1

+1

temp

-0.199.5

+0.1+0.1

-0.2

Page 54: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DELTA ENCODING

Recording the difference between values that follow each other in the same column.→ The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

54

Original Datatime

12:0112:00

12:0312:02

12:04

temp

99.499.5

99.699.5

99.4

Compressed Datatime

+112:00

+1+1

+1

temp

-0.199.5

+0.1+0.1

-0.2

Page 55: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DELTA ENCODING

Recording the difference between values that follow each other in the same column.→ The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

55

Original Datatime

12:0112:00

12:0312:02

12:04

temp

99.499.5

99.699.5

99.4

Compressed Datatime

(+1,4)12:00

temp

-0.199.5

+0.1+0.1

-0.2

Compressed Datatime

+112:00

+1+1

+1

temp

-0.199.5

+0.1+0.1

-0.2

Page 56: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

56

Original Data

robrobbedrobbingrobot

Page 57: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

57

Original Data

robrobbedrobbingrobot

Common Prefix

Page 58: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

58

Original Data

robrobbedrobbingrobot

Common Prefix

Page 59: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

59

Original Data

robrobbedrobbingrobot

Common Prefix

-

Page 60: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

60

Original Data

robrobbedrobbingrobot

Common Prefix

-

Page 61: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

61

Original Data

robrobbedrobbingrobot

Common Prefix

-

Page 62: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

62

Original Data

robrobbedrobbingrobot

Common Prefix

-rob

Page 63: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

63

Original Data

robrobbedrobbingrobot

Common Prefix

-robrobbrob

Page 64: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

64

Original Data

robrobbedrobbingrobot

Common Prefix

-robrobbrob

Compressed Data

robbedingot

0343

Page 65: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.This works best with sorted data.

65

Original Data

robrobbedrobbingrobot

Common Prefix

-robrobbrob

Compressed Data

robbedingot

0343

Prefix Length

Suffix

Page 66: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MOSTLY ENCODING

When the values for an attribute are “mostly” less than the largest size, you can store them as a smaller data type.→ The remaining values that cannot be compressed are

stored in their raw form.

66

Source: Redshift Documentation

Original Dataint64

42

699999999

8

Page 67: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MOSTLY ENCODING

When the values for an attribute are “mostly” less than the largest size, you can store them as a smaller data type.→ The remaining values that cannot be compressed are

stored in their raw form.

67

Source: Redshift Documentation

Original Dataint64

42

699999999

8

Compressed Datamostly8

42

6XXX

8

offset3

value99999999

Page 68: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DICTIONARY COMPRESSION

Most pervasive compression scheme in DBMSs.Replace frequent patterns with smaller codes.

Need to support fast encoding and decoding.Need to also support range queries.→ Example: SALARY > 100

68

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORESSIGMOD 2009

Page 69: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DICTIONARY COMPRESSION

When to construct the dictionary?What is the scope of the dictionary?How do we allow for range queries?How do we enable fast encoding/decoding?

69

Page 70: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DICTIONARY CONSTRUCTION

Choice #1: All At Once→ Compute the dictionary for all the tuples at a given point

of time.→ New tuples must use a separate dictionary or the all

tuples must be recomputed.

Choice #2: Incremental→ Merge new tuples in with an existing dictionary.→ Likely requires re-encoding of existing tuples.

70

Page 71: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DICTIONARY SCOPE

Choice #1: Block-level→ Only include a subset of tuples within a single table.→ Potentially lower compression ratio, but can add new

tuples more easily.→ Impact of dictionary data corruption is localizedChoice #2: Table-level→ Construct a dictionary for the entire table.→ Better compression ratio, but expensive to update.Choice #3: Multi-Table→ Can be either subset or entire tables.→ Sometimes helps with joins and set operations.

71

Page 72: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MULTI-ATTRIBUTE ENCODING

Instead of storing a single value per dictionary entry, store entries that span attributes.→ I’m not sure any DBMS actually implements this.

72

Original Data

202

val2

101202

101202

101

val1

BA

CA

BA

101C101B

Page 73: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

MULTI-ATTRIBUTE ENCODING

Instead of storing a single value per dictionary entry, store entries that span attributes.→ I’m not sure any DBMS actually implements this.

73

Original Data Compressed Data

202

val2

101202

101202

101

val1

BA

CA

BA

101C101B

val2

101202

101

val1

BA

C

code

YYXX

ZZ

val1+val2

YYXX

ZZXX

YY

ZZXX

YY

Page 74: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ENCODING / DECODING

A dictionary needs to support two operations:→ Encode: For a given uncompressed value, convert it into

its compressed form.→ Decode: For a given compressed value, convert it back

into its original form.

No magic hash function will do this for us.We need two hash tables to support operations in both directions.

74

Page 75: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

75

Original Data

nameAndrea

PrashanthAndyDana

Page 76: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

76

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 77: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

77

SELECT * FROM usersWHERE name LIKE 'And%'

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 78: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

78

SELECT * FROM usersWHERE name LIKE 'And%'

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

SELECT * FROM usersWHERE name BETWEEN 10 AND 20

Page 79: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

79

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 80: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

80

SELECT name FROM usersWHERE name LIKE 'And%' ???

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 81: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

81

SELECT name FROM usersWHERE name LIKE 'And%' Have to perform seq scan

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 82: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

82

SELECT name FROM usersWHERE name LIKE 'And%'

SELECT DISTINCT nameFROM usersWHERE name LIKE 'And%'

Have to perform seq scan

???

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 83: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

ORDER-PRESERVING COMPRESSION

83

SELECT name FROM usersWHERE name LIKE 'And%'

SELECT DISTINCT nameFROM usersWHERE name LIKE 'And%'

Have to perform seq scan

Only need to access dictionary

Original Data

nameAndrea

PrashanthAndyDana

Compressed Data

code10203040

valueAndreaAndyDana

Prashanth

name10402030

Page 84: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

DICTIONARY IMPLEMENTATIONS

Hash Table:→ Fast and compact.→ Unable to support range and prefix queries.

B+Tree:→ Slower than a hash table and takes more memory.→ Can support range and prefix queries.

84

Page 85: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

PARTING THOUGHTS

Dictionary encoding is probably the most useful compression scheme because it does not require pre-sorting.

The DBMS can combine different approaches for even better compression.

It is important to wait as long as possible during query execution to decompress data.

85

Page 86: GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #4 ...jarulraj/courses/4420-s... · RUN-LENGTH ENCODING Compress runs of the same value in a single column into triplets: →Example:

NEXT CLASS

Physical vs. Logical Logging

86