using rdma efficiently for key-value services · herd 3 1. improved understanding of rdma through...
TRANSCRIPT
Using RDMA Efficiently for Key-Value Services
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU)
1
RDMA
Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer.
2
HERD
3
1. Improved understanding of RDMA through micro-benchmarking
2. High-performance key-value system:
• Throughput: 26 Mops (2X higher than others)
• Latency: 5 µs (2X lower than others)
BA
RDMA introFeatures:
• Ultra-low latency: 1 µs RTT
• Zero copy + CPU bypass
4
User buffer DMA buffer NIC
Providers: !InfiniBand, RoCE,…
User bufferDMA bufferNIC
RDMA in the datacenter
5
48 port 10 GbE switches
Switch RDMA Cost
Mellanox SX1012 YES $5,900
Cisco 5548UP NO $8,180
Juniper EX5440 NO $7,480
In-memory KV stores
6
Webserver
Webserver
Webserver
Webserver
Webserver
memcached
memcached
Database
Interface: GET, PUT !
Requirements: • Low latency • High request rate
RDMA basics
7
RDMA read:READ(local_buf, size, remote_addr)
RDMA write:WRITE(local_buf, size, remote_addr)
RNIC
Verbs
Life of a WRITE
1
23
4
5
6
Requester Responder
1: Request descriptor, PIO
2: Payload, DMA read
3: RDMA write request
4: Payload, DMA write
5: RDMA ACK
6: Completion, DMA write
8
CPU,RAM RNIC RNIC CPU,RAM
Recent systemsPilaf [ATC 2013]
FaRM-KV [NSDI 2014]: an example usage of FaRM
!
Approach: RDMA reads to access remote data structures
Reason: the allure of CPU bypass
9
The price of CPU bypassKey-Value stores have an inherent level of indirection.
An index maps a keys to address. Values are stored separately.
Server’s DRAMValuesIndex
At least 2 RDMA reads required: ≧ 1 to fetch address 1 to fetch value
10
Not true if value is in index
The price of CPU bypass
11
The price of CPU bypass
READ #1 (fetch pointer)
Client
Server
12
The price of CPU bypass
Client
Server
13
READ #2 (fetch value)
Our approach
Goal Main ideas
#1: Use a single round trip Request-reply with server CPU involvement + WRITEs faster than READs
#2. Increase throughput Low level verbs optimizations
#3. Improve scalability Use datagram transport
14
Client
Server
#1: Use a single round trip
WRITE #1 (request)
WRITE #2 (reply)
DRAM accesses
15
#1: Use a single round trip
Operation Round Trips Operations at server’s RNIC
READ-based GET 2+ 2+ RDMA reads
HERD GET 1 2 RDMA writes
Lower latency High throughput
S
C
C
C
C
C
Setup: Apt Cluster !192 nodes, 56 Gbps IB
17
Thro
ughp
ut (M
ops)
0
10
20
30
40
Payload size (Bytes)4 32 64 92 128 160 192 224 256
READ WRITE
RDMA WRITEs faster than READs
RNIC CPU,RAM
ServerRDMA WRITE
18
RNIC CPU,RAM
ServerRDMA READ
RDMA write request
RDMA ACK
RDMA read request
RDMA read response
PCIe DMA write PCIe DMA read
Reason: PCIe writes faster than PCIe reads
RDMA WRITEs faster than READs
Request-reply throughput:
High-speed request-reply
S
C1
C8
Setup: one-to-one client-server communication
32 byte payloads
Thro
ughp
ut (M
ops)
0
10
20
30
Request-Reply READ
1
8
2 WRITEs
1 READ
19
2 READs
#2: Increase throughput
CPU,RAM RNIC RNIC CPU,RAM
Client Server
WRITE #1: Request
WRITE #2: Response
Processing
20
Simple request-reply:
Optimize WRITEs
CPU,RAM RNIC RNIC CPU,RAM
1
23
4
5
6
Requester Responder
+inlining: encapsulate payload in request descriptor (2→1)
+unreliable: use unreliable transport (- 5)
+unsignaled: don’t ask for request completions (- 6)
21
#2: Increase throughput
CPU,RAM RNIC RNIC CPU,RAM
Client Server
WRITE #1: Request
WRITE #2: Response
Processing
22
Optimized request-reply:
#2: Increase throughput
Thro
ughp
ut (M
ops)
0
5
10
15
20
25
30
Request-Reply READ
basic +unreliable+unsignaled +inlined
S
C1
C8
Setup: one-to-one client-server communication
1
8
23
#3: Improve scalability
S
C1
CN
Setup
1
N
Thro
ughp
ut (M
ops)
0
5
10
15
20
25
30
Number of client/server processes1 2 4 6 8 10 12 14 16
Request-Reply
24
#3: Improve scalability
SRAM
State 1
State 2
State 3
State N
C1
C2
C3
CN ||state|| > SRAM
Clients
SRAM
#3: Improve scalability
C1
C2
C3
Inbound scalability ≫ outbound because inbound state ( ) outbound ( ) ≪
Use datagram for outbound replies
Datagram only supports SEND/RECV. SEND/RECV is slow.
SEND/RECV is slow only at the receiver
Scalable request-reply
27
Thro
ughp
ut (M
ops)
0
10
20
30
40
Number of client/server processes1 2 4 6 8 10 12 14 16
Request-Reply (Naive)Request Reply (Hybrid)
S
C1
CN
Setup
1
N
RDMA write, connectedSEND, datagram
EvaluationHERD = Request-Reply + MICA [NSDI 2014]
!
Compare against emulated versions of Pilaf and FaRM-KV
• No datastore
• Focus on maximum performance achievable
28
Late
ncy
(mic
rose
cond
s)
0
4
8
12
Throughput (Mops)0 5 10 15 20 25 30
HERD
Latency vs throughput
95th percentile
5th percentile
29
48 byte items, GET intensive workload
26 Mops, 5 µs
Low load, 3.4 µs
Latency vs throughput48 byte items, GET intensive workload
Late
ncy
(mic
rose
cond
s)
0
3
6
9
12
Throughput (Mops)0 5 10 15 20 25 30
Emulated Pilaf Emulated FaRM-KV HERD95th percentile
5th percentile
26 Mops, 5 µs
12 Mops, 8 µs
Low load, 3.4 µs
30
Thro
ughp
ut (M
ops)
0
10
20
30
Value size (Bytes)4 8 16 32 64 128 256 512 1024
Emulated Pilaf Emulated FaRM-KVHERD
Throughput comparison
31
2X higher
16 byte keys, 95% GET workload
HERD• Re-designing RDMA-based KV stores to use a
single round trip
• WRITEs outperform READs
• Reduce PCIe and InfiniBand transactions
• Embrace SEND/RECV
• Code is online: https://github.com/efficient/HERD
32
Throughput comparisonTh
roug
hput
(Mop
s)
0
10
20
30
Value size4 8 16 32 64 128 256 512 1024
Emulated Pilaf Emulated FaRM-KVHERD READ
Faster than RDMA reads
33
16 byte keys, 95% GET workload
Throughput comparisonTh
roug
hput
(Mop
s)
0
5
10
15
20
25
30
Emulated Pilaf Emulated FaRM-KV HERD
5% PUT 50% PUT 100% PUT
48 byte items
34