emerging persistent memory hardware and zufs - pm-based file systems in user space
TRANSCRIPT
KernelTLV Filesystem Supersession
Part I File Systems: why, how and where (another slide deck)[email protected]
Part II Emerging PM HW & SW stack [email protected]
Part III ZUFS – PM-based file systems in user [email protected]@netapp.com
© 2017 NetApp, Inc. All rights reserved1
KernelTLV MeetupNov. 14th 2017
Part IIPersistent Memory (PM) – HW & SW implications
Emerging PM/NVDIMM devices, the value they bring to applications and
how they revolutionize the storage stack
© 2017 NetApp, Inc. All rights reserved2
KernelTLV MeetupNov. 14th 2017
About
© 2017 NetApp, Inc. All rights reserved.3
~12,000employees
50+countries
Only Top 5 vendor that is rapidly
growing
Celebrating
25 Years
Founded in 1992
NetApp acquired @ June 2017
TLV areabased
Recruiting FS Dev.
+1
PM SoftwarePioneer
Since 2014
Ground breaking Latencies
Storage Media Generations
© 2017 NetApp, Inc. All rights reserved4
PM marries the best of both worlds: + StoragePersistency
MemorySpeed
HDD FLASH
IOPS(even if random…)
Latency(even under load…)
NVDIMM / PM
Definitions
Rounded latency numbers & under typical load
5
SCM (Storage Class Memory)
Byte-addressable Media@ Near-memory speed
<1us
5 © 2017 NetApp, Inc. All rights reserved.
PM (Persistent Memory)
Byte-addressable Device@ Near-memory speed, on memory bus
PM-based Storage - Question Traditional Assumptions
Byte-addressable media
Block-addressable wrapper
SW layers Network SW caching Block abstraction
?
66 © 2017 NetApp, Inc. All rights reserved.
Memory Vs. Storage
Block wrapper
PM-based FS
Application
Block-based FS
Page Cache
bio
PM-based Software Approaches
Application Re-written Application
NPM
DAX-enabled FS
SW reuse Performance
App
SWInfrastructure
HW
Linux Kernel Enablers
“-o dax”
Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c
Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c
Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c
Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4
NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1
QEMO support
BTT (Block, Atomic)
PMEM (Direct Access)
DAX Enabled FS
NFIT Core
8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ
Block wrapper
PM-based FS
Applications
DAX-enabled FS
Storage semantics
PM-based Software Approaches
Memorysemantics
Block-based FS
Page Cache
bio
NPM
Mmap, ld/st, msyncRead/write, fsync
NVDIMM Driver
Examples:
Block wrapper
PM-based FS
Applications
DAX-enabled FS
Storage semantics
Memorysemantics
Block-based FS
Page Cache
bio
NPM
NTFS-DAXREFS-DAX
XFS-DAXEXT4-DAX
NOVALUMFS SIMFSHINFS
Plexistor M1FS
NVDIMM Driver
Examples:
Windows server 2016Linux 4.4 and aboveUbuntu 16.04 LTSRHEL 7.3Fedora 23SLES 12 SP2
Examples:
NVML 1.3
(*) Huge variance in features and stability(*) Good portability
PM-based Software Approaches
11 1111© 2017 NetApp, Inc. All rights reserved.
Part IIIZUFS - Zero-copy User-mode FS
New style user-mode filesystems that require: - Extremely Low-Latency- Synchronous & DAX
© 2017 NetApp, Inc. All rights reserved12
KernelTLV MeetupNov. 14th 2017
From VFS to Zufs
© 2017 NetApp, Inc. All rights reserved13
Why Userspace?
• Resiliency
• Ease of development
• Externals (e.g. compress, encrypt)
• Licensing
• Market requirements (avoid kernel modules)
© 2017 NetApp, Inc. All rights reserved14
ZUFS and FUSE are complementary tools
© 2017 NetApp, Inc. All rights reserved15
MotivationFuSE is great for HDDs and ok(ish) for SSDs, but not suitable for PMEM
© 2017 NetApp, Inc. All rights reserved.16
FlashHDD Memory
FUSE
SCM
?RDMATCP
Latency$/GB
FUSE ZUFS
Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM
SW Perf. goals • Secondary (High latency media)• Async I/O Throughput
• SW is the bottleneck • Latency is everything
SW caching Slow media ->Rely on OS Page Cache
Near-memory speed media ->Bypass OS Page Cache
Access method I/O only I/O and mmap (DAX)
Cost of redundant copy / context switch
Negligible The bottleneck ->Avoid copies, queues & remain on core
Latency penalty under load
100s of µs 3-4 µs
De
sig
n A
ss
um
pti
on
s
Zufs Overview
Core 1
Core 2
Core 3
Core 4
© 2017 NetApp, Inc. All rights reserved18
Kernel to Userspace
© 2017 NetApp, Inc. All rights reserved19
ZUFS – Kernel Zoom-in
© 2017 NetApp, Inc. All rights reserved20
KernelTLV MeetupNov. 14th 2017
Preliminary Results FUSE Vs. ZUFS (PM Media)
© 2017 NetApp, Inc. All rights reserved.21
• Measured on Dual socket Intel XEON 2650v4 (48 HW Threads)DRAM-backed PMEM type
• Random 4KB DirectIO writ(ish) access
Architecture
© 2017 NetApp, Inc. All rights reserved.22
APP
zt-vma
PPP
App pages Mapped into Server VM
Unmapped on return
ZUSZu Server
ZUFZu Feeder
zt per cpu ...
kernel
ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)
Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
ZT-vma - Mmap 4M vma zero copy communication area per ZT
IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation
After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.
On exit (or server crash) file is closed, Kernel cleans all resources
Async operation is also supported. Server can return EAGAIN.
Server will later complete the operation ASYNC. App will be woken up.
Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM
© 2017 NetApp, Inc. All rights reserved. 23
Architecture
Perf. Optimizations - 1 MMAP_LOCAL_CPU
© 2017 NetApp, Inc. All rights reserved24
• mm patch to allow single-core TLB invalidate (in the common case)
0
5
10
15
20
25
30
- 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000
Late
ncy
[us]
IOPS
ZUFS w/wo mm patch
ZUFS_unpatched_mm
ZUFS_patched_mm
Perf. Optimizations - 2
© 2017 NetApp, Inc. All rights reserved. 25
• scheduler patch to allow efficient context switch on same core (Relay Object)
UnimplementedNo Perf. Results
© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---26
Questions