using rdma efficiently for key-value services · herd 3 1. improved understanding of rdma through...

34
Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1

Upload: others

Post on 03-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Using RDMA Efficiently for Key-Value Services

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU)

1

Page 2: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

RDMA

Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer.

2

Page 3: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

HERD

3

1. Improved understanding of RDMA through micro-benchmarking

2. High-performance key-value system:

• Throughput: 26 Mops (2X higher than others)

• Latency: 5 µs (2X lower than others)

Page 4: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

BA

RDMA introFeatures:

• Ultra-low latency: 1 µs RTT

• Zero copy + CPU bypass

4

User buffer DMA buffer NIC

Providers: !InfiniBand, RoCE,…

User bufferDMA bufferNIC

Page 5: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

RDMA in the datacenter

5

48 port 10 GbE switches

Switch RDMA Cost

Mellanox SX1012 YES $5,900

Cisco 5548UP NO $8,180

Juniper EX5440 NO $7,480

Page 6: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

In-memory KV stores

6

Webserver

Webserver

Webserver

Webserver

Webserver

memcached

memcached

Database

Interface: GET, PUT !

Requirements: • Low latency • High request rate

Page 7: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

RDMA basics

7

RDMA read:READ(local_buf, size, remote_addr)

RDMA write:WRITE(local_buf, size, remote_addr)

RNIC

Verbs

Page 8: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Life of a WRITE

1

23

4

5

6

Requester Responder

1: Request descriptor, PIO

2: Payload, DMA read

3: RDMA write request

4: Payload, DMA write

5: RDMA ACK

6: Completion, DMA write

8

CPU,RAM RNIC RNIC CPU,RAM

Page 9: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Recent systemsPilaf [ATC 2013]

FaRM-KV [NSDI 2014]: an example usage of FaRM

!

Approach: RDMA reads to access remote data structures

Reason: the allure of CPU bypass

9

Page 10: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

The price of CPU bypassKey-Value stores have an inherent level of indirection.

An index maps a keys to address. Values are stored separately.

Server’s DRAMValuesIndex

At least 2 RDMA reads required: ≧ 1 to fetch address 1 to fetch value

10

Not true if value is in index

Page 11: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

The price of CPU bypass

11

Page 12: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

The price of CPU bypass

READ #1 (fetch pointer)

Client

Server

12

Page 13: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

The price of CPU bypass

Client

Server

13

READ #2 (fetch value)

Page 14: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Our approach

Goal Main ideas

#1: Use a single round trip Request-reply with server CPU involvement + WRITEs faster than READs

#2. Increase throughput Low level verbs optimizations

#3. Improve scalability Use datagram transport

14

Page 15: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Client

Server

#1: Use a single round trip

WRITE #1 (request)

WRITE #2 (reply)

DRAM accesses

15

Page 16: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#1: Use a single round trip

Operation Round Trips Operations at server’s RNIC

READ-based GET 2+ 2+ RDMA reads

HERD GET 1 2 RDMA writes

Lower latency High throughput

Page 17: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

S

C

C

C

C

C

Setup: Apt Cluster !192 nodes, 56 Gbps IB

17

Thro

ughp

ut (M

ops)

0

10

20

30

40

Payload size (Bytes)4 32 64 92 128 160 192 224 256

READ WRITE

RDMA WRITEs faster than READs

Page 18: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

RNIC CPU,RAM

ServerRDMA WRITE

18

RNIC CPU,RAM

ServerRDMA READ

RDMA write request

RDMA ACK

RDMA read request

RDMA read response

PCIe DMA write PCIe DMA read

Reason: PCIe writes faster than PCIe reads

RDMA WRITEs faster than READs

Page 19: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Request-reply throughput:

High-speed request-reply

S

C1

C8

Setup: one-to-one client-server communication

32 byte payloads

Thro

ughp

ut (M

ops)

0

10

20

30

Request-Reply READ

1

8

2 WRITEs

1 READ

19

2 READs

Page 20: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#2: Increase throughput

CPU,RAM RNIC RNIC CPU,RAM

Client Server

WRITE #1: Request

WRITE #2: Response

Processing

20

Simple request-reply:

Page 21: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Optimize WRITEs

CPU,RAM RNIC RNIC CPU,RAM

1

23

4

5

6

Requester Responder

+inlining: encapsulate payload in request descriptor (2→1)

+unreliable: use unreliable transport (- 5)

+unsignaled: don’t ask for request completions (- 6)

21

Page 22: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#2: Increase throughput

CPU,RAM RNIC RNIC CPU,RAM

Client Server

WRITE #1: Request

WRITE #2: Response

Processing

22

Optimized request-reply:

Page 23: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#2: Increase throughput

Thro

ughp

ut (M

ops)

0

5

10

15

20

25

30

Request-Reply READ

basic +unreliable+unsignaled +inlined

S

C1

C8

Setup: one-to-one client-server communication

1

8

23

Page 24: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#3: Improve scalability

S

C1

CN

Setup

1

N

Thro

ughp

ut (M

ops)

0

5

10

15

20

25

30

Number of client/server processes1 2 4 6 8 10 12 14 16

Request-Reply

24

Page 25: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

#3: Improve scalability

SRAM

State 1

State 2

State 3

State N

C1

C2

C3

CN ||state|| > SRAM

Clients

Page 26: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

SRAM

#3: Improve scalability

C1

C2

C3

Inbound scalability ≫ outbound because inbound state ( ) outbound ( ) ≪

Use datagram for outbound replies

Datagram only supports SEND/RECV. SEND/RECV is slow.

SEND/RECV is slow only at the receiver

Page 27: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Scalable request-reply

27

Thro

ughp

ut (M

ops)

0

10

20

30

40

Number of client/server processes1 2 4 6 8 10 12 14 16

Request-Reply (Naive)Request Reply (Hybrid)

S

C1

CN

Setup

1

N

RDMA write, connectedSEND, datagram

Page 28: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

EvaluationHERD = Request-Reply + MICA [NSDI 2014]

!

Compare against emulated versions of Pilaf and FaRM-KV

• No datastore

• Focus on maximum performance achievable

28

Page 29: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Late

ncy

(mic

rose

cond

s)

0

4

8

12

Throughput (Mops)0 5 10 15 20 25 30

HERD

Latency vs throughput

95th percentile

5th percentile

29

48 byte items, GET intensive workload

26 Mops, 5 µs

Low load, 3.4 µs

Page 30: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Latency vs throughput48 byte items, GET intensive workload

Late

ncy

(mic

rose

cond

s)

0

3

6

9

12

Throughput (Mops)0 5 10 15 20 25 30

Emulated Pilaf Emulated FaRM-KV HERD95th percentile

5th percentile

26 Mops, 5 µs

12 Mops, 8 µs

Low load, 3.4 µs

30

Page 31: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Thro

ughp

ut (M

ops)

0

10

20

30

Value size (Bytes)4 8 16 32 64 128 256 512 1024

Emulated Pilaf Emulated FaRM-KVHERD

Throughput comparison

31

2X higher

16 byte keys, 95% GET workload

Page 32: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

HERD• Re-designing RDMA-based KV stores to use a

single round trip

• WRITEs outperform READs

• Reduce PCIe and InfiniBand transactions

• Embrace SEND/RECV

• Code is online: https://github.com/efficient/HERD

32

Page 33: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Throughput comparisonTh

roug

hput

(Mop

s)

0

10

20

30

Value size4 8 16 32 64 128 256 512 1024

Emulated Pilaf Emulated FaRM-KVHERD READ

Faster than RDMA reads

33

16 byte keys, 95% GET workload

Page 34: Using RDMA Efficiently for Key-Value Services · HERD 3 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops

Throughput comparisonTh

roug

hput

(Mop

s)

0

5

10

15

20

25

30

Emulated Pilaf Emulated FaRM-KV HERD

5% PUT 50% PUT 100% PUT

48 byte items

34