![Page 1: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/1.jpg)
The Cloud Without the Fluff:Rethinking Resource Disaggregation
Ryan StutsmanUniversity of Utah
Utah Scalable ComputerSystems Lab
![Page 2: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/2.jpg)
There are smart people behind this work
![Page 3: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/3.jpg)
Outline
1. Basically, a rant on RDMA2. Splinter, some real work we’ve done3. Raving aspirations & fawning over other people’s work
![Page 4: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/4.jpg)
RDMA Rant
![Page 5: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/5.jpg)
Tight Coupling → Poor Utilization
Complicated Fault-tolerance
The Bad Old Days
Compute
Storage
![Page 6: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/6.jpg)
Decouple Compute & Storage using Network
Disaggregation to the Rescue
Compute
Storage
![Page 7: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/7.jpg)
Decouple Compute & Storage using Network
Provision at Idle Capacity
Disaggregation to the Rescue
Compute
Storage
![Page 8: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/8.jpg)
Decouple Compute & Storage using Network
Provision at Idle Capacity
Scale Independently
Disaggregation to the Rescue
Compute
Storage
![Page 9: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/9.jpg)
Decouple Compute & Storage using Network
Provision at Idle Capacity
Scale Independently
Disaggregation to the Rescue
Compute
Storage
Key: tenant densities at compute and storage are very different.
10s of VMsvs
1000s of storage tenants per machine
![Page 10: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/10.jpg)
High-delay Network Stacks Leave Compute Stalled
Massive Data Movementment Destroys Efficiency
Can Run Tenant Logic to Avoid Some Movement
The Fly in the Ointment
Compute
Storage
Late
ncy
Data Movement
![Page 11: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/11.jpg)
1 μs @ 200 Gbps!
![Page 12: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/12.jpg)
Transformative Performance! … Right?
![Page 13: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/13.jpg)
1-sided RDMAs in approximately one slide
Memory-mapped I/O over PCIe to post RDMA descriptor
Remote virtual address, length, & local address target
READ,Remote VA,
Local VA
![Page 14: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/14.jpg)
1-sided RDMAs in approximately one slide
Local NIC forward RDMA READ operation to remote NIC
READ,Remote VA,
Local VA
![Page 15: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/15.jpg)
1-sided RDMAs in approximately one slide
Remote NIC transfers requested back to client
Local NIC DMAs data from NIC into chosen destinationno CPU involvement at the remote machine
READ,Remote VA,
Local VA
![Page 16: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/16.jpg)
2-sided RDMAs (for completeness)
Remote machine gets a notification of recv’d message
Still uses fast DMA, but “activates” remote CPU(which would usually be polling for the message)
SEND,Local VA
![Page 17: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/17.jpg)
The Tricky Part of 1-sided RDMA
Local
get(k)?
Remote
p’
p
k v
![Page 18: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/18.jpg)
The Tricky Part of 1-sided RDMA
Local
get(k)?
rdma_read(p + ?)
rdma_read(p + h(k) % n)
Remote
p’
p
k v
![Page 19: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/19.jpg)
The Tricky Part of 1-sided RDMA
Local
get(k)?
rdma_read(p + ?)
rdma_read(p + h(k) % n)
rdma_read(p’)
strcmp
Remote
p’
p
k v
![Page 20: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/20.jpg)
The Tricky Part of 1-sided RDMA
Local
get(k)?
rdma_read(p + ?)
rdma_read(p + h(k) % n)
rdma_read(p’)
strcmp
Remote
p’
p
k v
1. “Read amplification” traversing server2. Tight-coupling of client logic to server layout3. Complex synchronization
a. Access from other threads?b. GC/compaction/defrag/hash table resize?
![Page 21: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/21.jpg)
10-100⨉ latency improvement
1-sided RDMA
Compute
Storage
ms delays→
μs delays
![Page 22: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/22.jpg)
10-100⨉ latency improvement
Restricted Programming Interface
1-sided RDMA
Compute
Storage
ms delays→
μs delays
![Page 23: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/23.jpg)
10-100⨉ latency improvement
Restricted Programming Interface
Explosion in Data Movement
1-sided RDMA
Compute
Storage
ms delays→
μs delaysData Movement
![Page 24: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/24.jpg)
Fast and Dumb?
![Page 25: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/25.jpg)
Catch 22
> 10,000 machines blasting data at 200 Gbpsdoesn’t make sense
But, making storage tier programmablerisks locking compute and storage tier
We can’t go back to poor scaling, utilization, and fault-tolerance!
![Page 26: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/26.jpg)
Splinter [OSDI’18]
![Page 27: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/27.jpg)
The Splinter Vision in a Nutshell
What would it take to solve this?
![Page 28: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/28.jpg)
The Splinter Vision in a Nutshell
What would it take to solve this?
μs-scale Storage-level Functions
10s of Millions of Invocations per Second per Server(2-sided ops!)
Native Code Performance
Strong Isolation for ≥ 10,000 Tenants
Dynamic Placement of Function Execution
![Page 29: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/29.jpg)
What would it take to solve this?
μs-scale Storage-level Functions
10s of Millions of Invocations per Second per Server(2-sided ops!)
Native Code Performance
Strong Isolation for ≥ 10,000 Tenants
Dynamic Placement of Function Execution
The Splinter Vision in a Nutshell
Avoids data movementwithout coupling compute and storage
![Page 30: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/30.jpg)
Splinter in Action: Simple CRUD
Kernel-bypass & Zero-copy Networking
Lookup of Small Data Item → ~10 μs
![Page 31: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/31.jpg)
Splinter in Action: UDF Invocation
Invocation of Small User-defined Function → ~10 μs
![Page 32: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/32.jpg)
Internal Dispatch and Request ProcessingStorage Server
RNIC
Tenants
Tenant-BasedNIC-level Dispatch
Task Scheduling
DPDK ReceiveQueues
Tenant-providedFunctions
Tenant Data
![Page 33: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/33.jpg)
Facebook Tao
“Onloading” assoc_range LinkBench ops improves tput
Take with a massive grain of saltold version of FaRM, different hw, older NICs, etc etc.
Shows promise - combining is actually best
Problem: only works if not server CPU bottlenecked
400,000ops/s/core
200,000ops/s/core
0
Splinter
FaRM
![Page 34: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/34.jpg)
Splinter in Action
Dynamically shift invocations to avoid bottlenecks
Exploit idle compute anywhere
No data movement in the normal case
No tight coupling of compute and storage
![Page 35: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/35.jpg)
Applying Others’ Amazing Ideas
![Page 36: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/36.jpg)
Two Key Challenges
● Fine-grained Scheduling○ Head-of-line-blocking → preemption → inefficient kthreads?○ New ideas virtualizing CPU interrupt controller○ Now possible to preempt tasks at 5 μs granularity○ Great for tail latency!
● Strong Protection○ Speed → native code → hardware isolation → slow○ ≥10,000 functions → context switches → low tput, utilization○ Fast ring 3 to ring 3 address space switching with VMFUNC
![Page 37: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/37.jpg)
The Benefit of Fine-grained Preemption
Simulated 32 cores with open-loop load clients
99.9% 1 μs requests with 0.1% 500 μs requests
HOB destroys response times unless 5 μs quanta
3x throughput improvement
with 10x median to tail
![Page 38: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/38.jpg)
New Hw Protection Scales Better
Conventional page table switch cuts to ~½ throughput
New hw schemes give good performance & VM isolation
Fast enough that DB/storage can use ubiquitously
Up to1.7x throughput
improvement
![Page 39: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/39.jpg)
Current Status
(Mostly) cooperative + language-based protection
Works but struggles if functions run long
Vulnerable to speculative execution attacks
Currently assessing secondary impacts of fine-grained preemption and protection
I-cache and TLB pressure likely to be significant
![Page 40: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/40.jpg)
Comparison with Morsel Parallelism
Does this matter if a store JITs SQL queries?SQL can be compiled to be cooperativeCan break into granular chunks of work and balance
But,This doesn’t help with UDFsOnly works if JIT’ed coded is trustedAnd still might be less efficient
(pre-emption can have 0 cost if it isn’t triggered)
![Page 41: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/41.jpg)
Summary
Fast networks don’t solve disaggregation inefficiency
Delivering μs-scale kernel-bypass performance at scale is more than a network problem
Current scheduling and protection approaches breakdown
Storage layer/database will need fine-grained control over scheduling and protection
![Page 42: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/42.jpg)
![Page 43: The Cloud Without the Fluff - HPTS - 2019 Home](https://reader035.vdocuments.net/reader035/viewer/2022071612/6157014aa097e25c764ffcb2/html5/thumbnails/43.jpg)
Image Attributions:
Blue Flame: Public Domain; wikipedia.comNetwork card icon: Icon made by srip from www.flaticon.comCPU icon: Icon made by monkik from www.flaticon.comMemory icon: Icon made by Smashicons from www.flaticon.com