characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems tayler...

17
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike O’Connor* Tor M. Aamodt ɣ ɣ UBC *AMD University of British Columbia In Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS) Rich Miler – www.datacenterknowledge.c

Upload: alia-greenwood

Post on 19-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

Characterizing and Evaluating a Key-value Store Application on Heterogeneous

CPU-GPU Systems

Tayler H. Hetheringtonɣ

Timothy G. Rogersɣ

Lisa Hsu*Mike O’Connor*Tor M. Aamodtɣ

ɣUBC *AMD

University of British ColumbiaIn Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)

Rich Miler – www.datacenterknowledge.com

Page 2: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

2

Server farms require a lot of power– Need for efficient, cost-effective solutions– GPU/APUs

New types of workloads– Non-HPC– Server applications

Server applications– Memcached

Programmer’s initial intuition into an application’s behavior

Intuition Actual0%

10%20%30%40%50%

SIM

D E

f-fic

ienc

yMotivation

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Bruno Giussani – ww.wired.com

Page 3: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

3

Background Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

*Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar

Page 4: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

4

Memcached - Compatible with GPU?• Irregular control flow • Irregular memory access patterns • Large memory requirements• Highly input data dependent

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 5: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

5

Porting MemcachedSimple key-value lookup

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

• READ (GET)

requests on

GPU

• WRITE (SET)

requests on

CPU

Page 6: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

6

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

Porting Memcached - Batching

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

GET Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

Servern

GET

Server2

Hash

Memory

Key Comparison

Return Hit/Miss

Hash chaining

Miss

Hit

Page 7: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

7

Porting Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• Main Goals– Increase request throughput– Keep request latency reasonable

• Main Challenges– Irregular memory access patterns– Irregular control flow– Data transfer overheads

Page 8: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

8

Methodology

• Hardware– AMD Radeon HD 5870 (Discrete)– AMD Llano A8-3850 (Fusion)– AMD Zacate E-350 (Fusion)

• Simulators– GPGPU-Sim v3.x – In-house GPU control flow simulator

• Testing and Simulation– Traces of Wikipedia accesses

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 9: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

9

Porting MemcachedMemory Access

• One request per work item

• Data accesses for GET requests are input data dependent

• Data can be anywhere in memory– Poor performance on GPU?

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 10: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

10

No L1 Cache

8k 8-way

32k 8-way

64k 8-way

128k 8-way

256k 8-way

1M 8-way

1M FA No Mem

La-tency

No Mem Stalls

0%

5%

10%

15%

20%

25%

30%

35%

Perc

enta

ge o

f Pea

k IP

C

Porting MemcachedMemory Divergence

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 11: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

11

Porting MemcachedControl Flow

• Recall the control flow graph

• Many branch outcomes are input data dependent

Work item ID 1 – 2 – 3 – 4 – 5

1 – 2 – 5 3 – 4

1 – 5 2 3 – 4

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 12: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

12

MC (Pes) MC (Aug) MC (Act)0%

10%20%30%40%50%60%70%80%90%

100% SIMD Efficiency Breakdown

1-45-89-1213-1617-2021-2425-2829-32

# Active Work-items

Porting MemcachedControl Flow

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

15% 40%

62%

29%

Overall

51%

Page 13: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

13

Porting MemcachedData Management

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• Dynamic memory manager

• Transfer memory regions to device

•Virtual addresses different on host and device

Page 14: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

14

Porting MemcachedData Transfer Reduction

• Fusion Systems– Physical shared memory region between host and device– Zero-copy data

• Discrete Systems– Possible transfer reduction techniques• Reduction in unnecessary transfers• Acyclic data transfers (Overlap comm. with comp.)• Automatic data transfer frameworks

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 15: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

15

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

10203040

Performance vs. CPUNo Data Transfers

Data Transfers

Spee

d up

(X)

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500

5

10

15

20

25

30

35

Performance vs. CPU

No Data Transfers

Data Transfers

Spee

d up

(X)

AMD Radeon HD 5870

Llano A8-3850 Zacate E-3500%

20%

40%

60%

80%

100%Execution Breakdown

Data Transfer

Execution

Porting Memcached

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Page 16: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

16

0 10000 20000 30000 40000 50000 60000 70000 80000 900000

1

2

3

4

5

6

7

8

0

2

4

6

8

10

12

14

Normalized Throughput (Requests/Second) Normalized LatencyLatency - 0.5ms

Requests / Batch

Nor

mal

ized

Thr

ough

put

Nor

mal

ized

Lat

ency

ResultsRadeon HD 5870

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

• ~8000 requests yields highest ratio of throughput to latency

Page 17: Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike

17

Summary• Programmer intuition doesn’t always paint the

whole picture• We exploited the available parallelism on

GPUs by batching requests, showing a 7.5X performance increase on the Llano system

• Data transfer overheads can have a large impact on overall performance

• Thank you – Questions?

Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU

Rich Miler – www.datacenterknowledge.com