20150318-sfpug-meetup-pgstrom
TRANSCRIPT
PG-Strom Query Acceleration Engine of PostgreSQL
Powered by GPGPU
NEC OSS Promotion Center
The PG-Strom Project
KaiGai Kohei <[email protected]>
Self Introduction
▌Name: KaiGai Kohei
▌Company: NEC
▌Mission: Software architect & Intrepreneur
▌Background:
Linux kernel development (2003~?)
PostgreSQL development (2006~)
SAP alliance (2011~2013)
PG-Strom development & productization (2012~)
▌PG-Strom Project:
In-company startup of NEC
Also, an open source software project
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.2
What is PG-Strom
▌An Extension of PostgreSQL
▌Off-loads CPU intensive SQL workloads to GPU processors
▌Major Features
① Automatic and just-in-time GPU code generation from SQL
② Asynchronous and concurrent query executor
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.3
database
Query Executor
Query Planner
Custom Executor
Custom Planner
GPU code on the fly SQL
command
Async- Execution
PG-Strom Q
uery
Fro
nte
nd
Concept
▌No Pain Looks like a traditional PostgreSQL database from standpoint of
applications, thus, we can utilize existing tools, drivers, applications.
▌No Tuning Massive computing capability by GPGPU kills necessity of database
tuning by human. It allows engineering folks to focus on the task only human can do.
▌No Complexity No need to export large data to external tools from RDBMS, because its
computing performance is sufficient to run the workloads nearby data.
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.4
RDBMS and bottleneck (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 5
Storage
Processor
Data
RAM
Data Size > RAM Data Size < RAM
Storage
Processor
Data
RAM
In the future?
Processor
Wide Band RAM
Non-volatile
RAM
Data
World of current cpu/memory bottleneck
Join, Aggregation, Sort, Projection, ...
[strategy]
• burstable access pattern
• parallel algorithm
World of traditional disk-i/o bottleneck
SeqScan, IndexScan, ...
[strategy]
• reduction of i/o (size, count)
• distribution of disk (RAID)
RDBMS and bottleneck (2/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 6
Processor
RAM
Storage
bandwidth: multiple hundreds GB/s
bandwidth: multiple GB/s
Background (1/4) – Semiconductor Trend
▌Movement to CPU/GPU integrated architecture rather than multicore CPU
▌Free lunch for SW by HW evolution will finish soon
Unless software is not designed to utilize GPU capability, unable to pull-out the full hardware capability.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 7
SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
Background (2/4) – Features of GPU
▌Characteristics
Larger percentage of ALUs on chip
Relatively smaller percentage of cache and control logic
Advantages to simple calculation in parallel, but not complicated logic
Much higher number of cores per price
• GTX750Ti (640core) with $150
GPU CPU
Model Nvidia Tesla K20X Intel Xeon E5-2670 v3
Architecture Kepler Haswell
Launch Nov-2012 Sep-2014
# of transistors 7.1billion 3.84billion
# of cores 2688 (simple) 12 (functional)
Core clock 732MHz 2.6GHz,
up to 3.5GHz
Peak Flops (single precision)
3.95TFLOPS 998.4GFLOPS
(with AVX2)
DRAM size 6GB, GDDR5 768GB/socket,
DDR4
Memory band 250GB/s 68GB/s
Power consumption
235W 135W
Price $3,000 $2,094
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 8
SOURCE: CUDA C Programming Guide (v6.5)
Background (3/4) – How GPU works
● item[0]
step.1 step.2 step.4 step.3
Computing the sum of array:
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with N-cores of GPU
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[] with log2N steps
Inter core synchronization by HW support
Background (4/4) – Custom-Plan Interface
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.10
Aggregate
SELECT cat, avg(x) FROM t1, t2
WHERE t1.id = t2.id AND y > 100
GROUP BY cat;
Scan on t1 Scan on t2
Join
t1 t2
key: cat • Hash Join • Merge Join • Nested Loop • Custom Join
• Seq Scan • Index Scan • Index-Only Scan • Tid Scan • Custom Scan
IndexScan on t1
y > 100
“BulkLoad” on t1
“GpuHashJoin”
t1.id = t2.id
PG-Strom Features
▌Logics
GpuScan ... Parallel evaluation of scan qualifiers
GpuHashJoin ... Parallel multi-relational join
GpuPreAgg ... Two phase aggregation
GpuSort ... GPU + CPU Hybrid Sorting
GpuNestedLoop (in develop)
▌Data Types
Integer, Float, Date/Time, Numeric, Text
▌Function and Operators
Equality and comparison operators
Arithmetic operators and mathematical functions
Aggregates: count, min/max, sum, avg, std, var, corr, regr
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.11
Automatic GPU code generation
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.12
postgres=# SET pg_strom.show_device_kernel = on;
postgres=# EXPLAIN VERBOSE SELECT * FROM t0 WHERE sqrt(x+y) < 10;
QUERY PLAN
--------------------------------------------------------------------------------
Custom Scan (GpuScan) on public.t0 (cost=500.00..357569.35 rows=6666683 width=77)
Output: id, cat, aid, bid, cid, did, eid, x, y, z
Device Filter: (sqrt((t0.x + t0.y)) < 10::double precision)
Features: likely-tuple-slot
Kernel Source: #include "opencl_common.h“
:
static pg_bool_t
gpuscan_qual_eval(__private cl_int *errcode,
__global kern_parambuf *kparams,
__global kern_data_store *kds,
__global kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KPARAM_0 = pg_float8_param(kparams,errcode,0);
pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);
pg_float8_t KVAR_9 = pg_float8_vref(kds,ktoast,errcode,8,kds_index);
return pgfn_float8lt(errcode,
pgfn_dsqrt(errcode, pgfn_float8pl(errcode, KVAR_8, KVAR_9)), KPARAM_0);
}
Implementation (1/3) – GpuScan
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.13
Table
DMA Send
DMA Recv
DMA Send
DMA Recv
DMA Send
DMA Recv
Execution of auto-generated GPU code
Result Output Stream
Input Stream
Chunk (16~64MB)
PostgreSQL PG-Strom
Software Architecture (1/2) – current version
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 14
GPU Code Generator
Storage
Storage Manager
Shared Buffer
Query Parser
Query Optimizer
Query Executor
SQL Query
Breaks down the query to parse tree
Makes query execution plan
Run the query Cust
om
-Pla
n A
PIs
GpuScan
GpuHashJoin
GpuPreAgg
GPU Program Manager
PG-Strom OpenCL Server
Message Queue
GpuSort
PostgreSQL PG-Strom
Software Architecture (2/2) – upcoming version
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 15
GPU Code Generator
Storage
Storage Manager
Shared Buffer
Query Parser
Query Optimizer
Query Executor
SQL Query
Breaks down the query to parse tree
Makes query execution plan
Run the query Cust
om
-Pla
n A
PIs
GpuScan
GpuHashJoin
GpuPreAgg
CUDA controller
GpuSort
DMA Buffer
Just
-in-t
ime
com
pile
with
NVRTC
Copy DMA
kernel launch
Implementation (2/3) – GpuHashJoin
PG-Strom Preview Feb-2015 Page. 16
Inner relation
Outer relation
Inner relation
Outer relation
Hash Table Hash Table
Next stage Next stage
CPU just references materialized results
Hash-Table Search by CPU
Sequential Materialization by CPU
Parallel Materialization
Parallel Hash-Table
Search
vanilla Hash-Join GpuHashJoin
Benchmark result (1/2) – simple tables join
▌Benchmark Query: SELECT * FROM t0 NATURAL JOIN t1 [NATURAL JOIN ....];
▌Environment:
t0 has 100million rows (13GB), t1-t9 has 40,000 rows for each, all-data pre-loaded
CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2, RAM: 384GB, GPU: Tesla K20c x1
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.17
18.19 19.45 21.04 23.66 26.69 37.64 43.22 49.57 56.38
64.27
87.73
109.73
132.21
155.10
179.62
207.85
233.31
263.51
0.00
50.00
100.00
150.00
200.00
250.00
300.00
2 3 4 5 6 7 8 9 10
Qu
ery
Re
spo
nse
Tim
e [
sec]
number of tables joined
Simple Tables Join Benchmark
PG-Strom PostgreSQL
Implementation (3/3) – GpuPreAgg
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.18
Table
1st Stage Reduction
2nd Stage Reduction
Chunk (16~64MB)
Benchmark result (2/2) – Star Schema Model
▌40 typical reporting queries
▌100GB of retail / start-schema data, all pre-loaded
▌Environment
CPU: Xeon E5-2670v3(12C, 2.3GHz) x2, RAM: 384GB, GPU: Tesla K20c x1
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.19
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
2000.00
Q.0
1
Q.0
2
Q.0
3
Q.0
4
Q.0
5
Q.0
6
Q.0
7
Q.0
8
Q.0
9
Q.1
0
Q.1
1
Q.1
2
Q.1
3
Q.1
4
Q.1
5
Q.1
6
Q.1
7
Q.1
8
Q.1
9
Q.2
0
Q.2
1
Q.2
2
Q.2
3
Q.2
4
Q.2
5
Q.2
6
Q.2
7
Q.2
8
Q.2
9
Q.3
0
Q.3
1
Q.3
2
Q.3
3
Q.3
4
Q.3
5
Q.3
6
Q.3
7
Q.3
8
Q.3
9
Q.4
0
Qu
ery
Res
po
nse
Tim
e [s
ec]
Typical Reporting Queries on Retail / Star-Schema Data
PG-Strom PostgreSQL
Expected Scenario – Reduction of ETL
▌ETL – Its design is human centric task
▌Replication – much automatous task
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.20
ERP CRM SCM BI
OLTP database
OLAP database
ETL
OLAP Cubes Master / Fact Tables
BI
Replication
Replica of Master / Fact Tables
Optimized to transaction
workloads
Optimized to analytic workloads
Sufficient to analytic
workloads also
PG-Strom
Development Plan
Migration of OpenCL to CUDA
Add support of GpuNestedLoop
Add support multi-functional kernel
Standardization of custom-join interface
...and more...?
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.22
Short term target: PostgreSQL v9.5 timeline (2015)
Middle term target: PostgreSQL v9.6 timeline (2016)
Current version: PG-Strom β + PostgreSQL v9.5devel
Integration with funnel executor
Investigation to SSD/NvRAM utilization
Custom-sort/aggregate interface
Add support for spatial data types (?)
Enhancement Idea (1/3) – GpuNestedLoop
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.23
Inner-Relation (Ny: relatively small)
Oute
r-Rela
tion
(Nx : re
lative
ly la
rge)
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
blockDim.x
blockDim.y
Ny items
Nx
items
Thread
(X=2, Y=3)
Each GPU threads evaluates non-equalar join condition
in parallel
2-dimensional GPU kernel launch
(Nx×Ny threads at once)
Enhancement Idea (2/3) – Multi-functional GPU kernel
PG-Strom Preview Feb-2015 24
magnetic storage
GpuScan
GpuHashJoin
GpuPreAgg
Final result
current query execution
Data Chunk
on-memory cache
Load
GpuScan Kernel
GpuHashJoin Kernel
GpuPreAgg Kernel
execution on GPU device
Interaction between
CPU and GPU multiple times
magnetic storage
Final result
execution in newer version
on-memory cache
Interaction between
CPU and GPU Only once
GpuMultiOps
GpuScan
GpuHashJoin
GpuPreAgg
GpuMultiOps Kernel
•PreAgg •HashJoin •Scan
Data Chunk
Enhancement Idea (3/3) – Funnel executor integration
PG-Strom Preview Feb-2015 25
magnetic storage
on-memory cache
Scan
Join
Aggregate
Final result
current query execution
Query Execution
Plan
PG-Strom may replace GPU version, but host system run with single thread.
execution in PostgreSQL v9.6
Funnel Executor
Partial Aggregate
Partial Scan
Partial Aggregate
Partial Scan
Partial Aggregate
Join
Partial Scan
Join Join
magnetic storage
SSD device
on-memory cache
Final result
Combined Aggregate
Funnel executor assigns a part of query execution task on worker processes
Combines multiple partial aggregates to generate the final result
Let’s try – Deployment on AWS
Page. 26
Search by “strom” !
AWS GPU Instance (g2.2xlarge)
CPU Xeon E5-2670 (8 xCPU)
RAM 15GB
GPU NVIDIA GRID K2 (1536core)
Storage 60GB of SSD
Price $0.898/hour, $646.56/mon (*) Price for on-demand instance on Tokyo region at Nov-2014
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Welcome to your involvement
▌How to be involved?
as a user
as a developer
as a business partner
▌Source code
https://github.com/pg-strom/devel
▌Contact US
e-mail: [email protected]
twitter: @kkaigai
PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.27
check it out!