assise: performance and availability via client-local nvm

82
Assise: Performance and Availability via Client-local NVM in a Distributed File System Thomas Anderson, Marco Canini, Jongyul Kim, Dejan Kostić, Youngjin Kwon, Simon Peter, Waleed Reda, Henry N. Schuh, and Emmett Witchel

Upload: others

Post on 03-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

PowerPoint PresentationAssise: Performance and Availability via Client-local NVM in a Distributed File System
Thomas Anderson, Marco Canini, Jongyul Kim, Dejan Kosti, Youngjin Kwon, Simon Peter, Waleed Reda, Henry N. Schuh, and Emmett Witchel
Commercial adoption of NVM
• Persistent across failures
• Persistent across failures
• Persistent across failures
• Persistent across failures
6
e.g. Ceph, NFS, …
7
e.g. Ceph, NFS, …
8
Application
e.g. Ceph, NFS, …
9
Application
e.g. Ceph, NFS, …
10
Userspace
Kernel
Application
11
Userspace
Kernel
Application
Data Daemon
Remote NVM
12
Userspace
Kernel
Application
Data Daemon
Remote NVM
13
Userspace
Kernel
Application
Data Daemon
Remote NVM
14
Userspace
Kernel
Data Daemon
Remote NVM
Challenge 3: File system must rebuild caches of failed clients
Primary App
Buffer Cache
Client 1
FS Client
Challenge 3: File system must rebuild caches of failed clients
Primary App
Buffer Cache
Client 1
FS Client
Challenge 3: File system must rebuild caches of failed clients
Primary App
Buffer Cache
Client 1
FS Client
Backup App
Buffer Cache
Client 2
FS Client
Primary App
Buffer Cache
Client 1
FS Client
Rebuild cache
1. Low latency IO: Perform process-local & client-local IO
2. Linearizability and data crash consistency CC-NVM distributed client-side NVM coherence protocol
3. High availability Fail-over to cache-hot client replicas, recover client NVM caches
18
FS Client
Local NVM
Local NVM
ProcessProcess (LibFS)
Local NVM
3 Log writes at operation granularity
ProcessProcess (LibFS)
Local userspace data access (= low latency IO)
1Process
3 Log writes at operation granularity
No block amplification (= efficient IO)
ProcessProcess (LibFS)
FS Client
Local NVM
FS Client
Local NVM
Process (LibFS)
Client 3
Hot remote client caches (= fast fail-over) Crash consistent via RDMA write order
SharedFS
FS Client
Local NVM
Process (LibFS)
Client 3
Hot remote client caches (= fast fail-over) Crash consistent via RDMA write order
RPC via RDMA for read misses in client-local caches
5
28
6 Process (LibFS)
29
6
Example:
30
6
Example:
Lease management
31
6
Example:
Lease management
32
6
Example:
Lease management
33
7
NVM
SharedFS
34
7
35
SharedFS
Proc 2 Proc 3 CC-NVM provides scalable access via hierarchical lease delegation
7
36
SharedFS
Proc 2 Proc 3 CC-NVM provides scalable access via hierarchical lease delegation
7
Cluster Manager
Client 1
1. Low latency IO: Perform process-local & client-local IO
2. Linearizability and data crash consistency CC-NVM distributed client-side NVM coherence protocol
3. High availability Fail-over to cache-hot client replicas, recover client NVM caches
4. Exploit NUMA locality
6. Low-cost cold storage
1. Low latency IO: Perform process-local & client-local IO
2. Linearizability and data crash consistency CC-NVM distributed client-side NVM coherence protocol
3. High availability Fail-over to cache-hot client replicas, recover client NVM caches
4. Exploit NUMA locality
6. Low-cost cold storage
1. Low latency IO: Perform process-local & client-local IO
2. Linearizability and data crash consistency CC-NVM distributed client-side NVM coherence protocol
3. High availability Fail-over to cache-hot client replicas, recover client NVM caches
4. Exploit NUMA locality
6. Low-cost cold storage
Evaluation questions
• Latency: How close is Assise to raw, replicated NVM IO latency? How much faster is Assise than the state-of-the-art?
• Scalability: By how much can CC-NVM improve multi-process and multi- node scalability?
• Availability: How quickly can Assise fail-over?
• More results in the paper • Large-scale distributed storage, cloud application benchmarks, re-establishing the
replication factor, …
Experimental Setup
• 5× dual-socket Cascade Lake-SP servers with 48 cores @ 2.2GHz • All nodes connect to an InfiniBand switch via 40 Gbps ConnectX-3 NICs
• Each node has 6 TB of Optane DC NVM and 384 GB of DRAM • 6x DIMMs of NVM and DRAM per NUMA node
41
Local consistency
Kernel bypass
Local consistency
Kernel bypass
Local consistency
Kernel bypass
Local consistency
Kernel bypass
Local consistency
Kernel bypass
Local consistency
Kernel bypass
Assise Octopus NFS Ceph
• 1GB burst of synchronous writes to a single file • Assise and Ceph replicate to an additional node
Microbenchmark: Write Latency
17x
• 1GB burst of synchronous writes to a single file • Assise and Ceph replicate to an additional node
kernel and IO amplification overheads
Microbenchmark: Write Latency
53x
17x
• 1GB burst of synchronous writes to a single file • Assise and Ceph replicate to an additional node
kernel and IO amplification overheads
Microbenchmark: Write Latency
53x
17x
• 1GB burst of synchronous writes to a single file • Assise and Ceph replicate to an additional node
kernel and IO amplification overheads
context switch overhead
53x
17x
• 1GB burst of synchronous writes to a single file • Assise and Ceph replicate to an additional node
kernel and IO amplification overheads
context switch overhead
Assise write latency is only 8% higher than raw NVM replicated write latency
2x faster than Octopus
• Multi-process benchmark to test scalability • Data & metadata operations (create file, write 4K, rename file)
• Embarrassingly parallel: Processes write to their own private sub-directories
• Replication is disabled: Eliminate network bottlenecks of our testbed
• Benchmark runs on 3 testbed nodes • Balanced deployment of processes over nodes
53
Th ro
u gh
p u
t (K
o p
s/ s)
Th ro
u gh
p u
t (K
o p
s/ s)
Th ro
u gh
p u
t (K
o p
s/ s)
Th ro
u gh
p u
t (K
o p
s/ s)
NVM write bandwidth
Postfix: Scalable Mail Delivery
• We replay 70GB worth of emails to a pool of Postfix Mail Delivery Agents • Deliver to Maildir: Each inbox is a directory, contains 1 file per mail message
• Real email trace; average mail size is 200 KB
• We use 3 nodes for mail delivery and 1 node for workload generation • Delivery processes are distributed uniformly over nodes
• Mail delivery load balancing 1. Round robin
2. Sharded across clients by recipient
58
Postfix: Scalable Mail Delivery
• We replay 70GB worth of emails to a pool of Postfix Mail Delivery Agents • Deliver to Maildir: Each inbox is a directory, contains 1 file per mail message
• Real email trace; average mail size is 200 KB
• We use 3 nodes for mail delivery and 1 node for workload generation • Delivery processes are distributed uniformly over nodes
• Mail delivery load balancing 1. Round robin
2. Sharded across clients by recipient
59
Postfix: Scalable Mail Delivery
• We replay 70GB worth of emails to a pool of Postfix Mail Delivery Agents • Deliver to Maildir: Each inbox is a directory, contains 1 file per mail message
• Real email trace; average mail size is 200 KB
• We use 3 nodes for mail delivery and 1 node for workload generation • Delivery processes are distributed uniformly over nodes
• Mail delivery load balancing 1. Round robin
2. Sharded across clients by recipient
60
Th ro
u gh
p u
t (m
ai ls
Network bandwidth
Th ro
u gh
p u
t (m
ai ls
Network bandwidth
Postfix: Scalable Mail Delivery
Th ro
u gh
p u
t (m
ai ls
Network bandwidth
LevelDB: Fail-over to backup
LevelDB (LibFS)
LevelDB (LibFS)
1
1
1 2
1 2
1 2 3
Primary BackupCluster Manager
1 2 3
Primary BackupCluster Manager
1 2 3
Primary BackupCluster Manager
1 2 3
Primary BackupCluster Manager
1 2 3
23,700 ms for Ceph
1 2 3
23,700 ms for Ceph
3,700 ms 15,600 ms
LevelDB (LibFS) LevelDB (LibFS)
LevelDB (LibFS) LevelDB (LibFS)
4
4
Assise
Primary Hot Backup
Primary BackupCluster Manager
4
Assise
Primary Hot Backup
Primary BackupCluster Manager
4
43,400 ms for Ceph
4
43,400 ms for Ceph
LevelDB (LibFS)
• Low IO latency
• High availability
• 22x lower write latency , 103x faster fail-over, 6x better scalability • Comparisons against Ceph using real applications (Postfix, LevelDB, ...)
Source code available at: https://github.com/ut-osa/assise
Questions? Email [email protected]