the cloud without the fluff - hpts - 2019 home

43
The Cloud Without the Fluff: Rethinking Resource Disaggregation Ryan Stutsman University of Utah Utah Scalable Computer Systems Lab

Upload: others

Post on 01-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Cloud Without the Fluff - HPTS - 2019 Home

The Cloud Without the Fluff:Rethinking Resource Disaggregation

Ryan StutsmanUniversity of Utah

Utah Scalable ComputerSystems Lab

Page 2: The Cloud Without the Fluff - HPTS - 2019 Home

There are smart people behind this work

Page 3: The Cloud Without the Fluff - HPTS - 2019 Home

Outline

1. Basically, a rant on RDMA2. Splinter, some real work we’ve done3. Raving aspirations & fawning over other people’s work

Page 4: The Cloud Without the Fluff - HPTS - 2019 Home

RDMA Rant

Page 5: The Cloud Without the Fluff - HPTS - 2019 Home

Tight Coupling → Poor Utilization

Complicated Fault-tolerance

The Bad Old Days

Compute

Storage

Page 6: The Cloud Without the Fluff - HPTS - 2019 Home

Decouple Compute & Storage using Network

Disaggregation to the Rescue

Compute

Storage

Page 7: The Cloud Without the Fluff - HPTS - 2019 Home

Decouple Compute & Storage using Network

Provision at Idle Capacity

Disaggregation to the Rescue

Compute

Storage

Page 8: The Cloud Without the Fluff - HPTS - 2019 Home

Decouple Compute & Storage using Network

Provision at Idle Capacity

Scale Independently

Disaggregation to the Rescue

Compute

Storage

Page 9: The Cloud Without the Fluff - HPTS - 2019 Home

Decouple Compute & Storage using Network

Provision at Idle Capacity

Scale Independently

Disaggregation to the Rescue

Compute

Storage

Key: tenant densities at compute and storage are very different.

10s of VMsvs

1000s of storage tenants per machine

Page 10: The Cloud Without the Fluff - HPTS - 2019 Home

High-delay Network Stacks Leave Compute Stalled

Massive Data Movementment Destroys Efficiency

Can Run Tenant Logic to Avoid Some Movement

The Fly in the Ointment

Compute

Storage

Late

ncy

Data Movement

Page 11: The Cloud Without the Fluff - HPTS - 2019 Home

1 μs @ 200 Gbps!

Page 12: The Cloud Without the Fluff - HPTS - 2019 Home

Transformative Performance! … Right?

Page 13: The Cloud Without the Fluff - HPTS - 2019 Home

1-sided RDMAs in approximately one slide

Memory-mapped I/O over PCIe to post RDMA descriptor

Remote virtual address, length, & local address target

READ,Remote VA,

Local VA

Page 14: The Cloud Without the Fluff - HPTS - 2019 Home

1-sided RDMAs in approximately one slide

Local NIC forward RDMA READ operation to remote NIC

READ,Remote VA,

Local VA

Page 15: The Cloud Without the Fluff - HPTS - 2019 Home

1-sided RDMAs in approximately one slide

Remote NIC transfers requested back to client

Local NIC DMAs data from NIC into chosen destinationno CPU involvement at the remote machine

READ,Remote VA,

Local VA

Page 16: The Cloud Without the Fluff - HPTS - 2019 Home

2-sided RDMAs (for completeness)

Remote machine gets a notification of recv’d message

Still uses fast DMA, but “activates” remote CPU(which would usually be polling for the message)

SEND,Local VA

Page 17: The Cloud Without the Fluff - HPTS - 2019 Home

The Tricky Part of 1-sided RDMA

Local

get(k)?

Remote

p’

p

k v

Page 18: The Cloud Without the Fluff - HPTS - 2019 Home

The Tricky Part of 1-sided RDMA

Local

get(k)?

rdma_read(p + ?)

rdma_read(p + h(k) % n)

Remote

p’

p

k v

Page 19: The Cloud Without the Fluff - HPTS - 2019 Home

The Tricky Part of 1-sided RDMA

Local

get(k)?

rdma_read(p + ?)

rdma_read(p + h(k) % n)

rdma_read(p’)

strcmp

Remote

p’

p

k v

Page 20: The Cloud Without the Fluff - HPTS - 2019 Home

The Tricky Part of 1-sided RDMA

Local

get(k)?

rdma_read(p + ?)

rdma_read(p + h(k) % n)

rdma_read(p’)

strcmp

Remote

p’

p

k v

1. “Read amplification” traversing server2. Tight-coupling of client logic to server layout3. Complex synchronization

a. Access from other threads?b. GC/compaction/defrag/hash table resize?

Page 21: The Cloud Without the Fluff - HPTS - 2019 Home

10-100⨉ latency improvement

1-sided RDMA

Compute

Storage

ms delays→

μs delays

Page 22: The Cloud Without the Fluff - HPTS - 2019 Home

10-100⨉ latency improvement

Restricted Programming Interface

1-sided RDMA

Compute

Storage

ms delays→

μs delays

Page 23: The Cloud Without the Fluff - HPTS - 2019 Home

10-100⨉ latency improvement

Restricted Programming Interface

Explosion in Data Movement

1-sided RDMA

Compute

Storage

ms delays→

μs delaysData Movement

Page 24: The Cloud Without the Fluff - HPTS - 2019 Home

Fast and Dumb?

Page 25: The Cloud Without the Fluff - HPTS - 2019 Home

Catch 22

> 10,000 machines blasting data at 200 Gbpsdoesn’t make sense

But, making storage tier programmablerisks locking compute and storage tier

We can’t go back to poor scaling, utilization, and fault-tolerance!

Page 26: The Cloud Without the Fluff - HPTS - 2019 Home

Splinter [OSDI’18]

Page 27: The Cloud Without the Fluff - HPTS - 2019 Home

The Splinter Vision in a Nutshell

What would it take to solve this?

Page 28: The Cloud Without the Fluff - HPTS - 2019 Home

The Splinter Vision in a Nutshell

What would it take to solve this?

μs-scale Storage-level Functions

10s of Millions of Invocations per Second per Server(2-sided ops!)

Native Code Performance

Strong Isolation for ≥ 10,000 Tenants

Dynamic Placement of Function Execution

Page 29: The Cloud Without the Fluff - HPTS - 2019 Home

What would it take to solve this?

μs-scale Storage-level Functions

10s of Millions of Invocations per Second per Server(2-sided ops!)

Native Code Performance

Strong Isolation for ≥ 10,000 Tenants

Dynamic Placement of Function Execution

The Splinter Vision in a Nutshell

Avoids data movementwithout coupling compute and storage

Page 30: The Cloud Without the Fluff - HPTS - 2019 Home

Splinter in Action: Simple CRUD

Kernel-bypass & Zero-copy Networking

Lookup of Small Data Item → ~10 μs

Page 31: The Cloud Without the Fluff - HPTS - 2019 Home

Splinter in Action: UDF Invocation

Invocation of Small User-defined Function → ~10 μs

Page 32: The Cloud Without the Fluff - HPTS - 2019 Home

Internal Dispatch and Request ProcessingStorage Server

RNIC

Tenants

Tenant-BasedNIC-level Dispatch

Task Scheduling

DPDK ReceiveQueues

Tenant-providedFunctions

Tenant Data

Page 33: The Cloud Without the Fluff - HPTS - 2019 Home

Facebook Tao

“Onloading” assoc_range LinkBench ops improves tput

Take with a massive grain of saltold version of FaRM, different hw, older NICs, etc etc.

Shows promise - combining is actually best

Problem: only works if not server CPU bottlenecked

400,000ops/s/core

200,000ops/s/core

0

Splinter

FaRM

Page 34: The Cloud Without the Fluff - HPTS - 2019 Home

Splinter in Action

Dynamically shift invocations to avoid bottlenecks

Exploit idle compute anywhere

No data movement in the normal case

No tight coupling of compute and storage

Page 35: The Cloud Without the Fluff - HPTS - 2019 Home

Applying Others’ Amazing Ideas

Page 36: The Cloud Without the Fluff - HPTS - 2019 Home

Two Key Challenges

● Fine-grained Scheduling○ Head-of-line-blocking → preemption → inefficient kthreads?○ New ideas virtualizing CPU interrupt controller○ Now possible to preempt tasks at 5 μs granularity○ Great for tail latency!

● Strong Protection○ Speed → native code → hardware isolation → slow○ ≥10,000 functions → context switches → low tput, utilization○ Fast ring 3 to ring 3 address space switching with VMFUNC

Page 37: The Cloud Without the Fluff - HPTS - 2019 Home

The Benefit of Fine-grained Preemption

Simulated 32 cores with open-loop load clients

99.9% 1 μs requests with 0.1% 500 μs requests

HOB destroys response times unless 5 μs quanta

3x throughput improvement

with 10x median to tail

Page 38: The Cloud Without the Fluff - HPTS - 2019 Home

New Hw Protection Scales Better

Conventional page table switch cuts to ~½ throughput

New hw schemes give good performance & VM isolation

Fast enough that DB/storage can use ubiquitously

Up to1.7x throughput

improvement

Page 39: The Cloud Without the Fluff - HPTS - 2019 Home

Current Status

(Mostly) cooperative + language-based protection

Works but struggles if functions run long

Vulnerable to speculative execution attacks

Currently assessing secondary impacts of fine-grained preemption and protection

I-cache and TLB pressure likely to be significant

Page 40: The Cloud Without the Fluff - HPTS - 2019 Home

Comparison with Morsel Parallelism

Does this matter if a store JITs SQL queries?SQL can be compiled to be cooperativeCan break into granular chunks of work and balance

But,This doesn’t help with UDFsOnly works if JIT’ed coded is trustedAnd still might be less efficient

(pre-emption can have 0 cost if it isn’t triggered)

Page 41: The Cloud Without the Fluff - HPTS - 2019 Home

Summary

Fast networks don’t solve disaggregation inefficiency

Delivering μs-scale kernel-bypass performance at scale is more than a network problem

Current scheduling and protection approaches breakdown

Storage layer/database will need fine-grained control over scheduling and protection

Page 42: The Cloud Without the Fluff - HPTS - 2019 Home
Page 43: The Cloud Without the Fluff - HPTS - 2019 Home

Image Attributions:

Blue Flame: Public Domain; wikipedia.comNetwork card icon: Icon made by srip from www.flaticon.comCPU icon: Icon made by monkik from www.flaticon.comMemory icon: Icon made by Smashicons from www.flaticon.com