dream or reality? on graphics processors: low-latency...

Low-Latency Transaction Execution on Graphics Processors:

Dream or Reality?Iya Arefyeva, Gabriel Campero Durand,

Marcus Pinnecke, David Broneske, Gunter Saake

Workgroup Databases and Software Engineering University of Magdeburg

Motivation: Context

2

GPGPUs are becoming essential for accelerating computation

- 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs- 56% of the flops on the list come from GPU acceleration [10]

Summit Supercomputer,Oak Ridge

Motivation: Context

3

Online analytical processing (OLAP):

- few long running tasks performed on big chunks of data - easy to exploit data parallelism → good for GPUs

GPU accelerated systems for OLAP: GDB [1], HyPE [2], CoGaDB [3], Ocelot [4], Caldera [5], MapD [8]

Online transaction processing (OLTP):

- thousands of short lived transactions within a short period of time- data should be processed as soon as possible due to user interaction

GPU accelerated systems for OLTP: GPUTx [6]

GPUs are also important for accelerating database workloads:

→ comparably less studied

Motivation: Context

4

Hybrid transactional and analytical processing (HTAP):

- real-time analytics on data that is ingested and modified in the transactional database engine

- challenging due to conflicting requirements in workloads

GPU accelerated systems for HTAP: Caldera [5]*

*However in Caldera, GPUs don’t process OLTP workloads → possible underutilization

Caldera Architecture [5]

Motivation: GPUs for OLTP

5

Intrinsic GPU challenges:

1. SPMD processing2. Coalesced memory access3. Branch divergence overheads4. Communication bottleneck: data needs to be transferred from RAM to GPU and back

over PCIe bus 5. Bandwidth bottleneck: bandwidth of PCIe bus is lower than the bandwidth of a GPU6. Limited memory

SM structure of Nvidia's Pascal GP100 SM [9]


6

OLTP Challenges:

- Managing isolation and consistency with massive parallelism- Previous research (GPUTx [6]) proposed a Bulk Execution Model and a

K-set transaction handling

Experiments with GPUTx [6]

Our contributions

7

In this early work, we

1. Evaluate a simplified version of the K-set execution model from GPUTx, assuming single key operations and massive point queries.

2. Test on a CRUD benchmark, reporting impact of batch sizes and bounded staleness.

3. Suggest 2 possible characteristics that could aid in the adoption of GPUs for OLTP, as we seek to adopt them in the design of an GPU OLTP query processor.

-

8

Prototype Design

batch collection

Implementation

9

- Storage engine is implemented in C++

- OpenCL for GPU programming

- The table is stored on the GPU (in case the GPU is used), only the necessary data is transferred

- Client requests are handled in a single thread

- In order to support operator-based K-sets several cases need to be considered.

- These cases determine our transaction manager

client 1 client 2 client N...

batch processing

replying to clients

requestrequest request

batch processing

batch processing

Implementation

10

write 1 write 2write 3write 4write 5write 6write 7

key 22key 4key 19key 8key 10

key 1key 56

write 8key 5

new request

collected batchfor writes

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)

Case 1: Reads or independent writes:

Implementation

11



key 1key 56

write 8key 5

write 8key 5

new request




Implementation

12



key 1key 56

batch processing

write 8key 5



key 1key 56

write 8key 5

write 8key 5

new request


batch full



Implementation

13

Case 2: Write after Write

write 8key 4

write 1 write 2write 3write 4write 5write 6


key 1


new request

!

Implementation

14


write 8key 4



key 1


new request

batch processing



key 1

!

flush writes

Implementation

15


write 8key 4



key 1


new request

batch processing

write 8key 4



key 1

!

flush writes


Implementation

16

Case 3: Read after Write

read 5key 4



key 1

new request

batch processing

read 5key 4



key 1

read 1 read 2read 3read 4key 7

key 13key 32key 25!

flush writes


collected batchfor reads

Implementation

17

Case 4: Write after Read

write 5key 4

read 1 read 2read 3read 4read 5read 6read 7


key 1key 56

new request

batch processing

write 5key 4

read 1 read 2read 3read 4read 5read 6read 7


key 1key 56

write 1 write 2write 3write 4key 7

key 13key 32key 25!

flush reads

collected batchfor reads


18

Evaluation

YCSB (Yahoo! Cloud Serving Benchmark)

19

YCSB client architecture [7]

100k read operations

All fields of a tuple are read

Zipfian distribution of requests

1 million update operations

Only one field is updated

Zipfian distribution of requests

100k read/update operations(50% reads and 50% updates)

80% operations access last entries (20% of tuples)

Workloads

20

Read-only Workload R Write-only Workload W Mixed Workload M

10k records in the tableEach tuple consists of 10 fields (100 bytes each), key length is 24 bytes

- CPU: Intel Xeon E5-2630- GPU: Nvidia Tesla K40c

- OpenCL 1.2- CentOS 7.1 (kernel version 3.10.0)

Goal: Evaluating performance on independent reads or write to find the impact of batch size

Goal: What is the impact of concurrency control?Do stale reads improve performance?

Evaluation (workload R - read only)

21

- CPU & row store provides the best performance

- Small batches reduce collection time- Very small batches for GPUs are not

efficient- Execution is faster with bigger

batches- However, it does not

compensate for slow response time

Evaluation (workload W - update only)

22

- CPU & row store provides the best performance

- Small batches reduce collection time- Very small batches for GPUs are not

efficient- Execution is faster with bigger

batches- However, it does not

compensate for slow response time

Evaluation (workload M - read/update, CPU)

23

- Concurrency control is beneficial for the CPUsmaller batches → clients get replies quicker

- Allowing stale reads (0.01 s) improves the performance for the CPU due to shorter waiting time before execution

- Big batches are better because of the reduced waiting time in case of conflicting operations big batches→ more operations are executed & the server waits less often

with CC w/o CC

with CC w/o CC

Evaluation (workload M - read/update, GPU)

24

- Concurrency control is not beneficial for the GPUsmaller batches → the GPU is not utilized efficiently

- Allowing stale reads improves the performance for the GPU & column store due to shorter waiting time before execution

- Big batches are better because of the reduced waiting time in case of conflicting operationsmore operations are executed → the server waits less often

with CC w/o CC

with CC w/o CC

25

Conclusions and Future Work

Discussion & Conclusion

26

The GPU batch size conundrum for OLTP:

Case 1: small batches are processed

- clients get replies quicker- GPUs are not utilized efficiently due to the small number of data elements

(this could be improved by splitting requests into fine-grained operations)

Case 2: big batches are processed

- many data elements are beneficial for GPUs- but it takes long to collect batches and throughput can be decreased

(this gets faster with higher arrival rates)

+ Other considerations: transfer overhead in case the table is not stored on the GPU

Future Work

27

+ More complex transactions and support for rollbacks

+ Concepts for recovery and logging

+ Comparison with state of the art systems

Future Work

28

+ More complex transactions and support for rollbacks

+ Concepts for recovery and logging

+ Comparison with state of the art systems

Thank you!

Questions?

29

References

1. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q. and Sander, P.V., 2009. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4), p.21.

2. Breß, S. and Saake, G., 2013. Why it is time for a HyPE: A hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 6(12), pp.1398-1403.

3. Breß, S., 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3), pp.199-209.

4. Heimel, M., Saecker, M., Pirk, H., Manegold, S. and Markl, V., 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9), pp.709-720.

5. Appuswamy, R., Karpathiotakis, M., Porobic, D. and Ailamaki, A., 2017. The Case For Heterogeneous HTAP. In 8th Biennial Conference on Innovative Data Systems Research (No. EPFL-CONF-224447).

6. He, B. and Yu, J.X., 2011. High-throughput transaction executions on graphics processors. Proceedings of the VLDB Endowment, 4(5), pp.314-325.

7. Cooper, Brian F., Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking cloud serving systems with YCSB." In Proceedings of the 1st ACM symposium on Cloud computing, pp. 143-154. ACM, 2010.

8. MapD Product Website: https://www.mapd.com/ 9. Soyata, Tolga. GPU Parallel Program Development Using CUDA. Chapman and Hall/CRC, 2018.

https://www.mapd.com/

30

References

10. Top 500 news: https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-on-the-top500/

11. Appuswamy, Raja, Angelos C. Anadiotis, Danica Porobic, Mustafa K. Iman, and Anastasia Ailamaki. "Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads." Proceedings of the VLDB Endowment 11, no. 2 (2017): 121-134.

https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-on-the-top500/

https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-on-the-top500/

Extra Slides: Isolation level

31

+ By assuming single operation transactions and no consistency checks, our results avoid divergence and report on larger batches.

+ We are on single-key CRUD, not SQL level.

+ All anomalies are

preventable through K-sets and managing the dependency graph, so this more general approach supports complete Serializability

Source: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/

Implementation: Introduction

32

- Two core components:- Process Manager(PM): In charge of how processes are executed.- Transactional Storage Manager (TxSM): Includes the concurrency control

functionality.

State-of-the-art PM and TxSM design for GPUs [6]

From the PM approaches we adopt the latter, but while GPUTx is stored procedure-based, our implementation is operator-based.Assumptions:

- Break transactions into smaller record-level operators (basic CRUD)- More complex transactions have to be managed higher in the hierarchy

GPU computing

- For optimal execution behavior, each thread within a work group should access sequential blocks of memory (coalesced memory access)

- Communication bottleneck: data needs to be transferred from RAM to GPU and back over a PCIe bus

- Bandwidth bottleneck: the bandwidth of a PCIe bus is lower than the bandwidth of a GPU

33

Block 1Shared memory

Thread Thread

Constant memory

Global memory

Texture memory

Block 2Shared memory

Thread Thread

CPU

RegistersRegistersRegisters Registers

- composed of many cores → multiple threads at a time

- efficient at data parallelism

- limited memory size

Local memory Local memory Local memory Local memory


34

GPUs for OLTP:

- The goal: to reduce the cost of database ownership by improvements in throughput

- There are challenges both from the device and from the workload

Row vs. column store

35

Row store:

- allows to quickly perform operations that affect all attributes

- one pointer to access a tuple

- fields of a tuple are likely to be pre-fetched in the cache

- good fit for OLTP

- if only a fraction of attributes is needed, unnecessary fields are retrieved together with relevant data

A

a1

B C

b1 c1

a2 b2 c2

a3 b3 c3

A

a1

B C

b1 c1a2 b2 c2a3 b3 c3

Column store:

- allows to read only the necessary data

- good fit for OLAP

- better compression rate

- requires accessing each field separately

dream or reality? on graphics processors: low-latency...

Documents