introduction to modern analytical db

Modern Analytical DBに関するTopics - Column-DBの話題もあるよ！ -

2012.11.21

列指向DBをみんなで勉強する会

Takeshi Yamamuro@maropu

1

Introduction

• 名前

– 山室健（Takeshi Yamamuro）

• 興味のあること

– データ工学 / データベース周辺技術

• PostgreSQL

• 探索，圧縮，並び替え

– Modern Hardware（CPU/GPU）によるアルゴリズム性能改善

• 本日の引用一覧 – https://docs.google.com/spreadsheet/ccc?key=0AnhMe3L1c3Z5

dFZ0TklsQUo5MlhDWHhReEM5NmpDQWc#gid=0

2

Twitter: maropu

Notice

• 今回紹介する話題の範囲

– C-StoreとMonetDB/X100周辺で発表された論文

– Column-DBはDB分野的にHW-conscious optimizationに分類されるため‘列配置’と直接関係のない話題も

3

長期的に見たAnalytical DBの方向性

• Stone Breaker, “What Does ‘Big Data’ Mean and Who Will Win?”, XLDB, 2012

4



5



6

Introduction to Modern Analytical DB 7

関連するDB研究の変遷

8

～1980’s ～2000’s ～2012

Columnar-storage for statistical applications (1970’s)

DSM [Geo85]

DSM in Bubba, highly-parallel DB [Geo85]

Bottleneck shifts in databases [Ail99, Bon99, Rao99]

Int’l Workshop on Data Management on New Hardware (DaMoN)

2005

VLDB 10 years best paper [Bon09]

Around 1996

MonetDB

Spin-off in 2005

MonetDB/X100 Vectorwise

2008

C-Store

2005

Vertica

Architecture shifts [Sto07]

Asilomar Report [Ber98]

Claremont Report [Agr09]

【前提】DBMSにおける処理の流れ

Parser

Planner

Executor

SQL

Query

Plan

Storage

Statistics

Catalog

データ構造

処理

字句解析/構文解析

9

プラン最適化前処理とコストモデルによるプラン選択

実行エンジンデータアクセス絞り込み、結合

メタデータ表情報、型情報

統計情報値の分布

ストレージデータベース

Modern Analytical DB周辺の話題

Parser

Planner

Executor

SQL

Query

Plan

Storage

Statistics

Catalog

データ構造

処理

関連技術の範囲

メタデータカラム圧縮情報カラム複製情報ソート情報

ストレージ BAT(DSM), PAX, data morphing, fractured mirros, clotho, MV(ROS/WOS)

プラン最適化圧縮コストモデル並列コストモデル階層メモリコストモデル分析処理独特な前処理 - ベクトル処理の考慮

実行エンジン tuples-at-a-time pipeline 圧縮を考慮したPN処理 NSM/DSMの切り替え HW-awareなPN処理 - sort/join/agg./scan - cache-aware algorithms - 並列/ベクトル処理実行

10

Column-Stores vs. Raw-Stores [Dan08]

• 性能差を起こす本質的な違いは何か？

• 列志向化によるI/O削減は本質的な理由ではなく、compression/late materialization/join optimization等のプラン最適化が性能向上の鍵

11

Figure 5から引用

ストレージ構成とプラン最適化

• 列志向（DSM）にした場合の重要な考慮点

– 余剰に発生すjoinコストへの対処

– プランノード（PN）のposition filter処理

12

NSM DSM

余剰に発生するjoinコストへの対処

• tupleに復元するために必要なjoin処理

13

pid Name

1 Alice

5 Bill

2 Bob

4 Jill

3 Steve

sort

ed

val

ue

s Column A

A.pid = B.pid

pid Age

4 24

5 24

1 28

3 29

2 32

Column B

SELECT * FROM xxx;

Name Age

Alice 28

Bill 24

Bob 32

Jill 24

Steve 29

テーブルxxx

※pidはMonetDBのBATにおけるoid [Bon09]

ストレージ構成とプラン最適化

• Joinコストを軽減するための手法

– 1. fractured mirrors [Rav02]やsuper projection [And12]のように元の順序関係を維持したストレージ構成で最適化

– 2. join Index [Val89]（下図）、さらにjoin高速化で対処

14

join index

Name

Alice

Bill

Bob

Jill

Steve

Age

24

24

28

29

32

join index

3

2

5

1

4

sort

ed

val

ue

s

column A column B

join高速化関係の論文

• Martina-Cezara Albutiu et al., Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems, VLDB, 2012

• S. Blanas et al., Design and Evaluation of main memory Hash Join Algorithms for Multi-core CPUs, SIGMOD, 2011

• C. Kim et al., Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB, 2008

• Mehul A. Shah et al., Fast Scans and Joins using Flash Drives, DaMon, 2008

• S. Chen et al., Improing Hash Join Performance through Prefetching, ACM TODS, 2007

• S. Manegold et al., Optimizing main-Memory Join on Modern Hardware, IEEE TKDE, 2002

• S. Manegold et al., What happens during a join dissecting CPU and Memory Optimization Effects, VLDB, 2000

• P. A. Boncz et al., Database Architecture Optimized for the New Bottleneck: Memory Access, VLDB, 1999

15

PNのposition filter処理 [Mig09]

• selectノードが対象とするカラム以外の絞り込みはpidを用いて実施*

• selectノードのpush-down処理によりプランの下端部が有向グラフに

• copyノードが入力されたpid集合からbitmapを作成してscanノードに転送

16

pid Name

1 Alice

5 Bill

2 Bob

4 Jill

3 Steve

sort

ed

val

ue

s

SELECT * FROM xxx WHERE Age > 25;

pid Age

4 24

5 24

1 28

3 29

2 32

σ Age > 25

Scan Node position filtered

A.pid = B.pid

Copy Node

*C-Storeのソースコードを参照

Vectorwize & Vertica

17

VectorwizeとVerticaの設計概要

• Vectorwise（旧MonetDB/X100） [Mar12]

ストレージ構成

– NSM/PAX～DSM/PAXで変更 • DDLからの指定も可能だが、自動最適化も

– データの圧縮はPFor系 [Mar06] • disk array（~1GiB/s）を前提にCPU高速な圧縮アルゴリズムを採用

– position delta tree [Hem10]による高速な更新処理 • 更新系と参照系のデータ構造を分割、あとでマージ処理

18


• Vectorwise（旧MonetDB/X100） [Mar12]

プラン最適化/実行エンジン

– tuples-at-a-time processing model [Bon05]を採用

– プラン実行中のNSM/DSMの切り替え [Min04][Mar08] • 実行operatorによって最適なtuple構造が異なる

– CPUのベクトル命令（SIMD）の活用 • Intel SSE4.2を活用した高速な文字列処理 [Vec09]

• ベクトル化のためのselectノードのpush-up処理 [Mar08]

– exchange operatorsを用いたプランの並列化 [Ani10] • Volcano-style parallelism

– JITコンパイルでの高速化 [Som11j, Som11v]（future works）

19


• Vectorwise構成からのPick-up 1/3

CPU高速な圧縮アルゴリズム [Mar06]

– Vectorwizeの基本設計はI/O転送速度とCPU処理速度の均衡をとりながら性能の最大化を目指す

– 復元速度がGiB/s以上のlight-weightな圧縮手法を採用

20

0.0

2.5

5.0

7.5

10.0

0.0%

10.0%

20.0%

30.0%

delta varbyte bintpltv optp4delta vseblocks vsesimple

dec

om

pre

ss s

pee

d

(GiB

/s)

com

pre

ssio

n r

atio

compression ratio

decompress speed

使用ライブラリ: http://integerencoding.isti.cnr.it/

最近の整数圧縮手法



CPU高速な圧縮アルゴリズム [Mar06]

– Vectorwizeの基本設計はI/O転送速度とCPU処理速度の均衡をとりながら性能の最大化を目指す

– 復元速度がGiB/s以上のlight-weightな圧縮手法を採用

21 引用：http://code.google.com/p/lz4/

文字列圧縮手法（単位はMiB/s）



tuples-at-a-time processing model [Bon05]を採用

– 実行パイプライン上で一回のnext()コールに対して複数のtupleを処理する実行方式

– 独立した処理を同時に処理することでCPU効率化を図る • instruction-level parallelism、SIMD最適化

22



tuples-at-a-time processing model [Bon05]を採用

23 Figure 10から引用



ベクトル化のためのselect句のpush-up処理 [Mar08]

– 選択率が高い場合はベクトル処理を優先したほうが良いケースがあるため、selectノードをpush-upする

24

論文内から引用

選択率が高い場合はSIMDで処理


• Vertica（旧C-Store） [And12]

ストレージ構成

– super projectionとnon-super projectionによる構成 • projectionは制限的なmaterialized view

• super projectionは元の順序関係を全て明示的に保持

– read and write optimized stores

• 更新系と参照系のデータ構造を分割、あとでマージ処理

– プランのscan operatorは6つの圧縮タイプに対応 • run-length encoding、block dictionary、…

25


• Vertica（旧C-Store） [And12]

プラン最適化/実行エンジン

– multi-thread/pipeline対応の実行エンジン

– tuples-at-a-time processing model

– 圧縮の効率的利用 [Mig09] • 圧縮データを直接処理可能な実行operators

• 圧縮データ読み込みのコストモデル化

– Send/Recs operatorによるプラン実行の分散化（volcano-style exchange operator）

26

Figure 3から引用


• Vertica構成からのPick-up 1/2

Super ProjectionとNon-Super Projectionによる構成

– 初期のC-Store[Sto05]の頃はprojection単位（列の論理グループ）のDSM構成だったが、joinコスト（join index経由）が打ち消せずにVertica移行時に構成を変更

27

Table 1から引用 Example 1から引用


• Vertica構成からのPick-up 1/2

Verticaが保持しているColumn-DB系のPatent

– http://worldwide.espacenet.com/searchResults?compact=false&ST=singleline&query=Vertica&locale=en_EP&DB=worldwide.espacenet.com

– DATABASE DESIGNER, US8290931

– QUERY OPTIMIZER, US2008033914

– MODULAR QUERY OPTIMIZER, US8312027

– DATABASE STORAGE ARCHITECTURE, US2011016157

– QUERY OPTIMIZER WITH SCHEMA CONVERSION, US8086598

– AUTOMATIC VERTICAL-DATABASE DESIGN, US2008040348

28

http://worldwide.espacenet.com/searchResults?compact=false&ST=singleline&query=Vertica&locale=en_EP&DB=worldwide.espacenet.com




C-Store Source Code Reading

• 現状読める唯一のColumn-DB Source Code

– MonetDBはX100に関しては非公開

– http://db.csail.mit.edu/projects/cstore

– 現在コードをチェック中、次回やるなら報告したい

29

http://db.csail.mit.edu/projects/cstore

http://db.csail.mit.edu/projects/cstore

最期に・・・

• 推薦する「これは読んでおけ！」論文

– Column-DBってあれでしょ？列方向にデータをシリアライズするやつ、と思っている方はコチラ→[Dan08]

– 最新のVectorwise/Verticaの設計が知りたい方は[Mar12][And12]を参照 • これらの引用論文を辿っていくと大体関連論文を網羅可能

30

introduction to modern analytical db

Documents