20150318-sfpug-meetup-pgstrom

PG-Strom Query Acceleration Engine of PostgreSQL

Powered by GPGPU

NEC OSS Promotion Center

The PG-Strom Project

KaiGai Kohei <[email protected]>

Self Introduction

▌Name: KaiGai Kohei

▌Company: NEC

▌Mission: Software architect & Intrepreneur

▌Background:

Linux kernel development (2003~?)

PostgreSQL development (2006~)

SAP alliance (2011~2013)

PG-Strom development & productization (2012~)

▌PG-Strom Project:

In-company startup of NEC

Also, an open source software project

PG-Strom - Query Acceleration Engine of PostgreSQL Powered by GPGPU - P.2

What is PG-Strom

▌An Extension of PostgreSQL

▌Off-loads CPU intensive SQL workloads to GPU processors

▌Major Features

① Automatic and just-in-time GPU code generation from SQL

② Asynchronous and concurrent query executor


database

Query Executor

Query Planner

Custom Executor

Custom Planner

GPU code on the fly SQL

command

Async- Execution

PG-Strom Q

uery

Fro

nte

nd

Concept

▌No Pain Looks like a traditional PostgreSQL database from standpoint of

applications, thus, we can utilize existing tools, drivers, applications.

▌No Tuning Massive computing capability by GPGPU kills necessity of database

tuning by human. It allows engineering folks to focus on the task only human can do.

▌No Complexity No need to export large data to external tools from RDBMS, because its

computing performance is sufficient to run the workloads nearby data.


RDBMS and bottleneck (1/2)

DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQL Page. 5

Storage

Processor

Data

RAM

Data Size > RAM Data Size < RAM

Storage

Processor

Data

RAM

In the future?

Processor

Wide Band RAM

Non-volatile

RAM

Data

World of current cpu/memory bottleneck

Join, Aggregation, Sort, Projection, ...

[strategy]

• burstable access pattern

• parallel algorithm

World of traditional disk-i/o bottleneck

SeqScan, IndexScan, ...

[strategy]

• reduction of i/o (size, count)

• distribution of disk (RAID)

RDBMS and bottleneck (2/2)


Processor

RAM

Storage

bandwidth： multiple hundreds GB/s

bandwidth： multiple GB/s

Background (1/4) – Semiconductor Trend

▌Movement to CPU/GPU integrated architecture rather than multicore CPU

▌Free lunch for SW by HW evolution will finish soon

Unless software is not designed to utilize GPU capability, unable to pull-out the full hardware capability.


SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013

http://www.slideshare.net/DevCentralAMD/keynote-lisa-su

Background (2/4) – Features of GPU

▌Characteristics

Larger percentage of ALUs on chip

Relatively smaller percentage of cache and control logic

Advantages to simple calculation in parallel, but not complicated logic

Much higher number of cores per price

• GTX750Ti (640core) with $150

GPU CPU

Model Nvidia Tesla K20X Intel Xeon E5-2670 v3

Architecture Kepler Haswell

Launch Nov-2012 Sep-2014

# of transistors 7.1billion 3.84billion

# of cores 2688 (simple) 12 (functional)

Core clock 732MHz 2.6GHz,

up to 3.5GHz

Peak Flops (single precision)

3.95TFLOPS 998.4GFLOPS

(with AVX2)

DRAM size 6GB, GDDR5 768GB/socket,

DDR4

Memory band 250GB/s 68GB/s

Power consumption

235W 135W

Price $3,000 $2,094


SOURCE: CUDA C Programming Guide (v6.5)

Background (3/4) – How GPU works

● item[0]

step.1 step.2 step.4 step.3

Computing the sum of array:

𝑖𝑡𝑒𝑚[𝑖]

𝑖=0…𝑁−1

with N-cores of GPU

◆

●

▲ ■ ★

● ◆

●

● ◆ ▲

●

● ◆

●

● ◆ ▲ ■

●

● ◆

●

● ◆ ▲

●

● ◆

●

item[1]

item[2]

item[3]

item[4]

item[5]

item[6]

item[7]

item[8]

item[9]

item[10]

item[11]

item[12]

item[13]

item[14]

item[15]

Total sum of items[] with log2N steps

Inter core synchronization by HW support

Background (4/4) – Custom-Plan Interface


Aggregate

SELECT cat, avg(x) FROM t1, t2

WHERE t1.id = t2.id AND y > 100

GROUP BY cat;

Scan on t1 Scan on t2

Join

t1 t2

key: cat • Hash Join • Merge Join • Nested Loop • Custom Join

• Seq Scan • Index Scan • Index-Only Scan • Tid Scan • Custom Scan

IndexScan on t1

y > 100

“BulkLoad” on t1

“GpuHashJoin”

t1.id = t2.id

PG-Strom Features

▌Logics

GpuScan ... Parallel evaluation of scan qualifiers

GpuHashJoin ... Parallel multi-relational join

GpuPreAgg ... Two phase aggregation

GpuSort ... GPU + CPU Hybrid Sorting

GpuNestedLoop (in develop)

▌Data Types

Integer, Float, Date/Time, Numeric, Text

▌Function and Operators

Equality and comparison operators

Arithmetic operators and mathematical functions

Aggregates: count, min/max, sum, avg, std, var, corr, regr


Automatic GPU code generation


postgres=# SET pg_strom.show_device_kernel = on;

postgres=# EXPLAIN VERBOSE SELECT * FROM t0 WHERE sqrt(x+y) < 10;

QUERY PLAN

--------------------------------------------------------------------------------

Custom Scan (GpuScan) on public.t0 (cost=500.00..357569.35 rows=6666683 width=77)

Output: id, cat, aid, bid, cid, did, eid, x, y, z

Device Filter: (sqrt((t0.x + t0.y)) < 10::double precision)

Features: likely-tuple-slot

Kernel Source: #include "opencl_common.h“

:

static pg_bool_t

gpuscan_qual_eval(__private cl_int *errcode,

__global kern_parambuf *kparams,

__global kern_data_store *kds,

__global kern_data_store *ktoast,

size_t kds_index)

{

pg_float8_t KPARAM_0 = pg_float8_param(kparams,errcode,0);

pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);

pg_float8_t KVAR_9 = pg_float8_vref(kds,ktoast,errcode,8,kds_index);

return pgfn_float8lt(errcode,

pgfn_dsqrt(errcode, pgfn_float8pl(errcode, KVAR_8, KVAR_9)), KPARAM_0);

}

Implementation (1/3) – GpuScan


Table

DMA Send

DMA Recv

DMA Send

DMA Recv

DMA Send

DMA Recv

Execution of auto-generated GPU code

Result Output Stream

Input Stream

Chunk (16~64MB)

PostgreSQL PG-Strom

Software Architecture (1/2) – current version


GPU Code Generator

Storage

Storage Manager

Shared Buffer

Query Parser

Query Optimizer

Query Executor

SQL Query

Breaks down the query to parse tree

Makes query execution plan

Run the query Cust

om

-Pla

n A

PIs

GpuScan

GpuHashJoin

GpuPreAgg

GPU Program Manager

PG-Strom OpenCL Server

Message Queue

GpuSort

PostgreSQL PG-Strom

Software Architecture (2/2) – upcoming version


GPU Code Generator

Storage

Storage Manager

Shared Buffer

Query Parser

Query Optimizer

Query Executor

SQL Query

Breaks down the query to parse tree

Makes query execution plan

Run the query Cust

om

-Pla

n A

PIs

GpuScan

GpuHashJoin

GpuPreAgg

CUDA controller

GpuSort

DMA Buffer

Just

-in-t

ime

com

pile

with

NVRTC

Copy DMA

kernel launch

Implementation (2/3) – GpuHashJoin

PG-Strom Preview Feb-2015 Page. 16

Inner relation

Outer relation

Inner relation

Outer relation

Hash Table Hash Table

Next stage Next stage

CPU just references materialized results

Hash-Table Search by CPU

Sequential Materialization by CPU

Parallel Materialization

Parallel Hash-Table

Search

vanilla Hash-Join GpuHashJoin

Benchmark result (1/2) – simple tables join

▌Benchmark Query: SELECT * FROM t0 NATURAL JOIN t1 [NATURAL JOIN ....];

▌Environment:

t0 has 100million rows (13GB), t1-t9 has 40,000 rows for each, all-data pre-loaded

CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2, RAM: 384GB, GPU: Tesla K20c x1


18.19 19.45 21.04 23.66 26.69 37.64 43.22 49.57 56.38

64.27

87.73

109.73

132.21

155.10

179.62

207.85

233.31

263.51

0.00

50.00

100.00

150.00

200.00

250.00

300.00

2 3 4 5 6 7 8 9 10

Qu

ery

Re

spo

nse

Tim

e [

sec]

number of tables joined

Simple Tables Join Benchmark

PG-Strom PostgreSQL

Implementation (3/3) – GpuPreAgg


Table

1st Stage Reduction

2nd Stage Reduction

Chunk (16~64MB)

Benchmark result (2/2) – Star Schema Model

▌40 typical reporting queries

▌100GB of retail / start-schema data, all pre-loaded

▌Environment

CPU: Xeon E5-2670v3(12C, 2.3GHz) x2, RAM: 384GB, GPU: Tesla K20c x1


0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

2000.00

Q.0

1

Q.0

2

Q.0

3

Q.0

4

Q.0

5

Q.0

6

Q.0

7

Q.0

8

Q.0

9

Q.1

0

Q.1

1

Q.1

2

Q.1

3

Q.1

4

Q.1

5

Q.1

6

Q.1

7

Q.1

8

Q.1

9

Q.2

0

Q.2

1

Q.2

2

Q.2

3

Q.2

4

Q.2

5

Q.2

6

Q.2

7

Q.2

8

Q.2

9

Q.3

0

Q.3

1

Q.3

2

Q.3

3

Q.3

4

Q.3

5

Q.3

6

Q.3

7

Q.3

8

Q.3

9

Q.4

0

Qu

ery

Res

po

nse

Tim

e [s

ec]

Typical Reporting Queries on Retail / Star-Schema Data

PG-Strom PostgreSQL

Expected Scenario – Reduction of ETL

▌ETL – Its design is human centric task

▌Replication – much automatous task


ERP CRM SCM BI

OLTP database

OLAP database

ETL

OLAP Cubes Master / Fact Tables

BI

Replication

Replica of Master / Fact Tables

Optimized to transaction

workloads

Optimized to analytic workloads

Sufficient to analytic

workloads also

PG-Strom

Direction of PG-Strom


Development Plan

Migration of OpenCL to CUDA

Add support of GpuNestedLoop

Add support multi-functional kernel

Standardization of custom-join interface

...and more...?


Short term target: PostgreSQL v9.5 timeline (2015)

Middle term target: PostgreSQL v9.6 timeline (2016)

Current version: PG-Strom β + PostgreSQL v9.5devel

Integration with funnel executor

Investigation to SSD/NvRAM utilization

Custom-sort/aggregate interface

Add support for spatial data types (?)

Enhancement Idea (1/3) – GpuNestedLoop


Inner-Relation (Ny: relatively small)

Oute

r-Rela

tion

(Nx : re

lative

ly la

rge)

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

blockDim.x

blockDim.y

Ny items

Nx

items

Thread

(X=2, Y=3)

Each GPU threads evaluates non-equalar join condition

in parallel

2-dimensional GPU kernel launch

(Nx×Ny threads at once)

Enhancement Idea (2/3) – Multi-functional GPU kernel

PG-Strom Preview Feb-2015 24

magnetic storage

GpuScan

GpuHashJoin

GpuPreAgg

Final result

current query execution

Data Chunk

on-memory cache

Load

GpuScan Kernel

GpuHashJoin Kernel

GpuPreAgg Kernel

execution on GPU device

Interaction between

CPU and GPU multiple times

magnetic storage

Final result

execution in newer version

on-memory cache

Interaction between

CPU and GPU Only once

GpuMultiOps

GpuScan

GpuHashJoin

GpuPreAgg

GpuMultiOps Kernel

•PreAgg •HashJoin •Scan

Data Chunk

Enhancement Idea (3/3) – Funnel executor integration

PG-Strom Preview Feb-2015 25

magnetic storage

on-memory cache

Scan

Join

Aggregate

Final result

current query execution

Query Execution

Plan

PG-Strom may replace GPU version, but host system run with single thread.

execution in PostgreSQL v9.6

Funnel Executor

Partial Aggregate

Partial Scan

Partial Aggregate

Partial Scan

Partial Aggregate

Join

Partial Scan

Join Join

magnetic storage

SSD device

on-memory cache

Final result

Combined Aggregate

Funnel executor assigns a part of query execution task on worker processes

Combines multiple partial aggregates to generate the final result

Let’s try – Deployment on AWS

Page. 26

Search by “strom” !

AWS GPU Instance (g2.2xlarge)

CPU Xeon E5-2670 (8 xCPU)

RAM 15GB

GPU NVIDIA GRID K2 (1536core)

Storage 60GB of SSD

Price $0.898/hour, $646.56/mon (*) Price for on-demand instance on Tokyo region at Nov-2014

The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL

Welcome to your involvement

▌How to be involved?

as a user

as a developer

as a business partner

▌Source code

https://github.com/pg-strom/devel

▌Contact US

e-mail: [email protected]

twitter: @kkaigai


check it out!

20150318-sfpug-meetup-pgstrom

Technology

tokyo pgstrom

gpgpu acceleration

postgresql development

postgresql page

extension of postgresql

gpu capability

large data

gpgpu nec oss promotion