dream or reality? on graphics processors: low-latency...

35
Low-Latency Transaction Execution on Graphics Processors: Dream or Reality? Iya Arefyeva, Gabriel Campero Durand, Marcus Pinnecke, David Broneske, Gunter Saake Workgroup Databases and Software Engineering University of Magdeburg

Upload: others

Post on 04-Nov-2019

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Low-Latency Transaction Execution on Graphics Processors:

Dream or Reality?Iya Arefyeva, Gabriel Campero Durand,

Marcus Pinnecke, David Broneske, Gunter Saake

Workgroup Databases and Software Engineering University of Magdeburg

Page 2: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: Context

2

GPGPUs are becoming essential for accelerating computation

- 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs- 56% of the flops on the list come from GPU acceleration [10]

Summit Supercomputer,Oak Ridge

Page 3: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: Context

3

Online analytical processing (OLAP):

- few long running tasks performed on big chunks of data - easy to exploit data parallelism → good for GPUs

GPU accelerated systems for OLAP: GDB [1], HyPE [2], CoGaDB [3], Ocelot [4], Caldera [5], MapD [8]

Online transaction processing (OLTP):

- thousands of short lived transactions within a short period of time- data should be processed as soon as possible due to user interaction

GPU accelerated systems for OLTP: GPUTx [6]

GPUs are also important for accelerating database workloads:

→ comparably less studied

Page 4: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: Context

4

Hybrid transactional and analytical processing (HTAP):

- real-time analytics on data that is ingested and modified in the transactional database engine

- challenging due to conflicting requirements in workloads

GPU accelerated systems for HTAP: Caldera [5]*

*However in Caldera, GPUs don’t process OLTP workloads → possible underutilization

Caldera Architecture [5]

Page 5: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: GPUs for OLTP

5

Intrinsic GPU challenges:

1. SPMD processing2. Coalesced memory access3. Branch divergence overheads4. Communication bottleneck: data needs to be transferred from RAM to GPU and back

over PCIe bus 5. Bandwidth bottleneck: bandwidth of PCIe bus is lower than the bandwidth of a GPU6. Limited memory

SM structure of Nvidia's Pascal GP100 SM [9]

Page 6: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: GPUs for OLTP

6

OLTP Challenges:

- Managing isolation and consistency with massive parallelism- Previous research (GPUTx [6]) proposed a Bulk Execution Model and a

K-set transaction handling

Experiments with GPUTx [6]

Page 7: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Our contributions

7

In this early work, we

1. Evaluate a simplified version of the K-set execution model from GPUTx, assuming single key operations and massive point queries.

2. Test on a CRUD benchmark, reporting impact of batch sizes and bounded staleness.

3. Suggest 2 possible characteristics that could aid in the adoption of GPUs for OLTP, as we seek to adopt them in the design of an GPU OLTP query processor.

-

Page 8: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

8

Prototype Design

Page 9: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

batch collection

Implementation

9

- Storage engine is implemented in C++

- OpenCL for GPU programming

- The table is stored on the GPU (in case the GPU is used), only the necessary data is transferred

- Client requests are handled in a single thread

- In order to support operator-based K-sets several cases need to be considered.

- These cases determine our transaction manager

client 1 client 2 client N...

batch processing

replying to clients

requestrequest request

batch processing

batch processing

Page 10: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

10

write 1 write 2write 3write 4write 5write 6write 7

key 22key 4key 19key 8key 10

key 1key 56

write 8key 5

new request

collected batchfor writes

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)

Case 1: Reads or independent writes:

Page 11: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

11

write 1 write 2write 3write 4write 5write 6write 7

key 22key 4key 19key 8key 10

key 1key 56

write 8key 5

write 8key 5

new request

collected batchfor writes

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)

Case 1: Reads or independent writes:

Page 12: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

12

write 1 write 2write 3write 4write 5write 6write 7

key 22key 4key 19key 8key 10

key 1key 56

batch processing

write 8key 5

write 1 write 2write 3write 4write 5write 6write 7

key 22key 4key 19key 8key 10

key 1key 56

write 8key 5

write 8key 5

new request

collected batchfor writes

batch full

Case 0: If a batch is not completely filled, the server waits for K seconds after receiving the last request and executes everything(K=0.1 in our experiments)

Case 1: Reads or independent writes:

Page 13: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

13

Case 2: Write after Write

write 8key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

collected batchfor writes

new request

!

Page 14: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

14

Case 2: Write after Write

write 8key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

collected batchfor writes

new request

batch processing

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

!

flush writes

Page 15: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

15

Case 2: Write after Write

write 8key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

collected batchfor writes

new request

batch processing

write 8key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

!

flush writes

collected batchfor writes

Page 16: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

16

Case 3: Read after Write

read 5key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

new request

batch processing

read 5key 4

write 1 write 2write 3write 4write 5write 6

key 22key 4key 19key 8key 10

key 1

read 1 read 2read 3read 4key 7

key 13key 32key 25!

flush writes

collected batchfor writes

collected batchfor reads

Page 17: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation

17

Case 4: Write after Read

write 5key 4

read 1 read 2read 3read 4read 5read 6read 7

key 22key 4key 19key 8key 10

key 1key 56

new request

batch processing

write 5key 4

read 1 read 2read 3read 4read 5read 6read 7

key 22key 4key 19key 8key 10

key 1key 56

write 1 write 2write 3write 4key 7

key 13key 32key 25!

flush reads

collected batchfor reads

collected batchfor writes

Page 18: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

18

Evaluation

Page 19: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

YCSB (Yahoo! Cloud Serving Benchmark)

19

YCSB client architecture [7]

Page 20: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

100k read operations

All fields of a tuple are read

Zipfian distribution of requests

1 million update operations

Only one field is updated

Zipfian distribution of requests

100k read/update operations(50% reads and 50% updates)

80% operations access last entries (20% of tuples)

Workloads

20

Read-only Workload R Write-only Workload W Mixed Workload M

10k records in the tableEach tuple consists of 10 fields (100 bytes each), key length is 24 bytes

- CPU: Intel Xeon E5-2630- GPU: Nvidia Tesla K40c

- OpenCL 1.2- CentOS 7.1 (kernel version 3.10.0)

Goal: Evaluating performance on independent reads or write to find the impact of batch size

Goal: What is the impact of concurrency control?Do stale reads improve performance?

Page 21: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Evaluation (workload R - read only)

21

- CPU & row store provides the best performance

- Small batches reduce collection time- Very small batches for GPUs are not

efficient- Execution is faster with bigger

batches- However, it does not

compensate for slow response time

Page 22: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Evaluation (workload W - update only)

22

- CPU & row store provides the best performance

- Small batches reduce collection time- Very small batches for GPUs are not

efficient- Execution is faster with bigger

batches- However, it does not

compensate for slow response time

Page 23: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Evaluation (workload M - read/update, CPU)

23

- Concurrency control is beneficial for the CPUsmaller batches → clients get replies quicker

- Allowing stale reads (0.01 s) improves the performance for the CPU due to shorter waiting time before execution

- Big batches are better because of the reduced waiting time in case of conflicting operations big batches→ more operations are executed & the server waits less often

with CC w/o CC

with CC w/o CC

Page 24: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Evaluation (workload M - read/update, GPU)

24

- Concurrency control is not beneficial for the GPUsmaller batches → the GPU is not utilized efficiently

- Allowing stale reads improves the performance for the GPU & column store due to shorter waiting time before execution

- Big batches are better because of the reduced waiting time in case of conflicting operationsmore operations are executed → the server waits less often

with CC w/o CC

with CC w/o CC

Page 25: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

25

Conclusions and Future Work

Page 26: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Discussion & Conclusion

26

The GPU batch size conundrum for OLTP:

Case 1: small batches are processed

- clients get replies quicker- GPUs are not utilized efficiently due to the small number of data elements

(this could be improved by splitting requests into fine-grained operations)

Case 2: big batches are processed

- many data elements are beneficial for GPUs- but it takes long to collect batches and throughput can be decreased

(this gets faster with higher arrival rates)

+ Other considerations: transfer overhead in case the table is not stored on the GPU

Page 27: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Future Work

27

+ More complex transactions and support for rollbacks

+ Concepts for recovery and logging

+ Comparison with state of the art systems

Page 28: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Future Work

28

+ More complex transactions and support for rollbacks

+ Concepts for recovery and logging

+ Comparison with state of the art systems

Thank you!

Questions?

Page 29: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

29

References

1. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q. and Sander, P.V., 2009. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4), p.21.

2. Breß, S. and Saake, G., 2013. Why it is time for a HyPE: A hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 6(12), pp.1398-1403.

3. Breß, S., 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3), pp.199-209.

4. Heimel, M., Saecker, M., Pirk, H., Manegold, S. and Markl, V., 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9), pp.709-720.

5. Appuswamy, R., Karpathiotakis, M., Porobic, D. and Ailamaki, A., 2017. The Case For Heterogeneous HTAP. In 8th Biennial Conference on Innovative Data Systems Research (No. EPFL-CONF-224447).

6. He, B. and Yu, J.X., 2011. High-throughput transaction executions on graphics processors. Proceedings of the VLDB Endowment, 4(5), pp.314-325.

7. Cooper, Brian F., Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking cloud serving systems with YCSB." In Proceedings of the 1st ACM symposium on Cloud computing, pp. 143-154. ACM, 2010.

8. MapD Product Website: https://www.mapd.com/ 9. Soyata, Tolga. GPU Parallel Program Development Using CUDA. Chapman and Hall/CRC, 2018.

Page 30: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

30

References

10. Top 500 news: https://www.top500.org/news/new-gpu-accelerated-supercomputers-change-the-balance-of-power-on-the-top500/

11. Appuswamy, Raja, Angelos C. Anadiotis, Danica Porobic, Mustafa K. Iman, and Anastasia Ailamaki. "Analyzing the impact of system architecture on the scalability of OLTP engines for high-contention workloads." Proceedings of the VLDB Endowment 11, no. 2 (2017): 121-134.

Page 31: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Extra Slides: Isolation level

31

+ By assuming single operation transactions and no consistency checks, our results avoid divergence and report on larger batches.

+ We are on single-key CRUD, not SQL level.

+ All anomalies are

preventable through K-sets and managing the dependency graph, so this more general approach supports complete Serializability

Source: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/

Page 32: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Implementation: Introduction

32

- Two core components:- Process Manager(PM): In charge of how processes are executed.- Transactional Storage Manager (TxSM): Includes the concurrency control

functionality.

State-of-the-art PM and TxSM design for GPUs [6]

From the PM approaches we adopt the latter, but while GPUTx is stored procedure-based, our implementation is operator-based.Assumptions:

- Break transactions into smaller record-level operators (basic CRUD)- More complex transactions have to be managed higher in the hierarchy

Page 33: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

GPU computing

- For optimal execution behavior, each thread within a work group should access sequential blocks of memory (coalesced memory access)

- Communication bottleneck: data needs to be transferred from RAM to GPU and back over a PCIe bus

- Bandwidth bottleneck: the bandwidth of a PCIe bus is lower than the bandwidth of a GPU

33

Block 1Shared memory

Thread Thread

Constant memory

Global memory

Texture memory

Block 2Shared memory

Thread Thread

CPU

RegistersRegistersRegisters Registers

- composed of many cores → multiple threads at a time

- efficient at data parallelism

- limited memory size

Local memory Local memory Local memory Local memory

Page 34: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Motivation: GPUs for OLTP

34

GPUs for OLTP:

- The goal: to reduce the cost of database ownership by improvements in throughput

- There are challenges both from the device and from the workload

Page 35: Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Row vs. column store

35

Row store:

- allows to quickly perform operations that affect all attributes

- one pointer to access a tuple

- fields of a tuple are likely to be pre-fetched in the cache

- good fit for OLTP

- if only a fraction of attributes is needed, unnecessary fields are retrieved together with relevant data

A

a1

B C

b1 c1

a2 b2 c2

a3 b3 c3

A

a1

B C

b1 c1a2 b2 c2a3 b3 c3

Column store:

- allows to read only the necessary data

- good fit for OLAP

- better compression rate

- requires accessing each field separately