a gpu accelerated storage system

23
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei Ripeanu

Upload: candy

Post on 21-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

A GPU Accelerated Storage System. Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei Ripeanu. NetSysLab The University of British Columbia. GPUs radically change the cost landscape. $600. $1279. (Source: CUDA Guide). Harnessing GPU Power is Challenging. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A GPU Accelerated Storage System

1

A GPU Accelerated Storage System

NetSysLabThe University of British Columbia

Abdullah Gharaibeh

with: Samer Al-Kiswany

Sathish Gopalakrishnan

Matei Ripeanu

Page 2: A GPU Accelerated Storage System

2

GPUs radically change the cost landscape

$600

$1279

(Source: CUDA Guide)

Page 3: A GPU Accelerated Storage System

3

more complex programming model

limited memory space

accelerator / co-processor model

Harnessing GPU Power is Challenging

Page 4: A GPU Accelerated Storage System

4

Does the 10x reduction in computation costs GPUs offer change the way we design/implement distributed systems?

Motivating Question:

Distributed Storage Systems

Context:

Page 5: A GPU Accelerated Storage System

5

Distributed Systems Computationally Intensive Operations

Hashing

Erasure coding

Encryption/decryption

Membership testing (Bloom-filter)

Compression

Computationally intensive Limit performance

Similarity detection

Content addressability

Security

Integrity checks

Redundancy

Load balancing

Summary cache

Storage efficiency

Operations Techniques

Page 6: A GPU Accelerated Storage System

6

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

Techniques To improve Performance/Reliability

b1b2

b3b n

Files divided into stream of blocks

Similarity Detection

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

Page 7: A GPU Accelerated Storage System

7

Contributions:

A GPU accelerated storage system:Design and prototype implementation that integrates similarity detection and GPU support

End-to-end system evaluation:2x throughput improvement for a realistic checkpointing workload

Page 8: A GPU Accelerated Storage System

8

Challenges

Integration Challenges

Minimizing the integration effort

Transparency

Separation of concerns

Extracting Major Performance Gains

Hiding memory allocation overheads

Hiding data transfer overheads

Efficient utilization of the GPU memory units

Use of multi-GPU systems

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

Hashing

Offloading Layer

Page 9: A GPU Accelerated Storage System

9

Past Work: Hashing on GPUs

HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems

1 “Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu,, HPDC ‘08

However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection

One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU

HashGPU

GPU

b1b2

b3b n

Hashing stream of blocks

Page 10: A GPU Accelerated Storage System

10

Profiling HashGPU

Amortizing memory allocation and overlapping data transfers and computation may bring important benefits

At least 75% overhead

Page 11: A GPU Accelerated Storage System

11

CrystalGPU

CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPU

One performance data point:CrystalGPU improves the speedup of HashGPU library by more than one order of magnitude

Page 12: A GPU Accelerated Storage System

12

CrystalGPU Opportunities and Enablers

Opportunity: Reusing GPU memory buffers

Enabler: a high-level memory manager

Opportunity: overlap the communication and computation

Enabler: double buffering and asynchronous kernel launch

Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)

Enabler: a task queue manager

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPUMemory Manager Task Queue

Double Buffering

Page 13: A GPU Accelerated Storage System

13

Experimental Evaluation: CrystalGPU evaluation End-to-end system evaluation

Page 14: A GPU Accelerated Storage System

14

CrystalGPU Evaluation

Testbed: A machine with

CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus

GPU: NVIDIA GeForce dual-GPU 9800GX2

Experiment space:

HashGPU/CrystalGPU vs. original HashGPU Three optimizations

Buffer reuse Overlap communication and computation Exploiting the two GPUs

HashGPU

GPU

b1b2

b3b n

Files divided into stream of blocks

CrystaGPU

Page 15: A GPU Accelerated Storage System

15

HashGPU Performance on top CrystalGPU

The gains enabled by the three optimizations can be realized!

Base Line: CPU Single Core

Page 16: A GPU Accelerated Storage System

16

Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU

Three implementations– No similarity detection (without-SD)– Similarity detection

• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)

Three workloads – Real checkpointing workload– Completely similar files: all possible gains in terms of data saving– Completely different files: only overheads, no gains

Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive

End-to-End System Evaluation

Page 17: A GPU Accelerated Storage System

17

System Throughput (Checkpointing Workload)

The integrated system preserves the throughput gains on a realistic workload!

1.8x improvement

Page 18: A GPU Accelerated Storage System

18

System Throughput (Synthetic Workload of Similar Files)

Offloading to the GPU enables close to optimal performance!

Room for 2ximprovement

Page 19: A GPU Accelerated Storage System

19

Impact on Competing (Compute Intensive) Application

Writing Checkpoints back to back

2ximprovement

Frees resources (CPU) to competing applications while preserving throughput gains!

7% reduction

Page 20: A GPU Accelerated Storage System

20

Summary

We present the design and implementation of a distributed storage system that integrates GPU power

We present CrystalGPU: a management layer that transparently enable common GPU optimizations across GPGPU applications

We empirically demonstrate that employing the GPU enable close to optimal system performance

We shed light on the impact of GPU offloading on competing applications running on the same node

Page 21: A GPU Accelerated Storage System

21

netsyslab.ece.ubc.ca

Page 22: A GPU Accelerated Storage System

22

File AX

Y

Z

Hashing

Similarity Detection

W

Y

Z

File BHashing

Only the first block is differentPotentially improving write throughput

Page 23: A GPU Accelerated Storage System

23

Execution Path on GPU – Data Processing Application

TTotal =

1

TPreprocesing

1

2

+ TDataHtoG

2

3

+ TProcessing

3

4

+ TDataGtoH

4

5

+ TPostProc

5

1. Preprocessing (memory allocation)

2. Data transfer in

3. GPU Processing

4. Data transfer out

5. Postprocessing