zoned namespaces - snia

14
Zoned Namespaces Use Cases, Standard and Linux Ecosystem SDC SNIA EMEA 20 | Javier González – Principal Software Engineer / R&D Center Lead

Upload: others

Post on 15-Oct-2021

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zoned Namespaces - SNIA

Zoned Namespaces Use Cases, Standard and Linux Ecosystem

SDC SNIA EMEA 20 | Javier González – Principal Software Engineer / R&D Center Lead

Page 2: Zoned Namespaces - SNIA

1

Why do we need a new interface?

Addr. Mapping (L2P)Data Placement Metadata Mgmt.

I/O SchedulingMedia Mgmt.Garbage Collection

Block LayerHostTraditional SSD

FTL - Log-Structured Mapping

Log-Structured File System (e.g., F2FS, XFS, ZFS)

Metadata Mgmt.

I/O Scheduling

Garbage Collection

Addr. Mapping (L2L)

Data Placement

Read / Write / Trim / Hints

VFS

User SpaceKernel Space

Log-Structured Applications (e.g., RocksDB)

Data PlacementAddr. Mapping (L2L)

Metadata Mgmt.

Garbage Collection Metadata Mgmt.

https://blocksandfiles.com/2019/08/28/nearline-disk-drives-ssd-attack/

1. Log-on-log Problem (WAF + OP)- Remove redundancies- Leverage log data structures

2. Cost Gap with HDDs- Higher bit count (QLC)- Reduce DRAM- Reduce OP & WAF

3. Multi-tenancy everywhere- Address noisy-neighbor- Provide QoS

SSDs are already mainstream• Great performance (Bandwidth / Latency) – Combination of NAND + NVMe• Easy to deploy - Direct replacement for HDDs• Acceptable price $/GB

But, we have 3 recurrent problems:

1.J. Yang, N. Plasson, G. Gillis, N. Talagala, and S. Sundararaman. Don’t stack your 2.log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating 3.Systems and Workloads (INFLOW), 2014.

Page 3: Zoned Namespaces - SNIA

2

ZNS: Basic Concepts

Zone 0 Zone 1 Zone 2 Zone n-1…

LBA 0 LBA m-1

ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA ZSLBA ZSLBA + ZCAP - 1

Write Pointer

- Strict sequential writes- Write after reset

Die 0 Die 1 Die 2 Die 3 Die k-1

Zone 0Zone1

Zone n-1…

Zone 0 Zone 1 Zone k-1Zone k Zone k+1 Zone 2k+1

Zone n-k Zone n-k+1 Zone n-1

……

……… …

Config. A

Config. BConfig. Z

Die 4 Die 5 Die 6 …

Divide LBA address space in fixed size zones• Write Constraints

- Write strictly sequentially within a zone (Write Pointer)- Explicitly reset write pointer

• No zone mapping constraints- Zone physical location transparent to host- Open design space for different configuration

Zone state machine• Zone transitions managed by host

- Direct commands or boundaries• Zone transitions triggered by device

- Notification to host through a exception

ZSE: Empty

ZSF: Full

ZSIO: Implicitly

Open

ZSEO: Explicitly

Open

ZSC: Closed

ZSRO: Read Only

ZSO: Offline

Open ResourcesActive Resources

LBA mapping in zone

Zone mapping in device

Page 4: Zoned Namespaces - SNIA

3

ZNS: Advanced ConceptsAppend Command• Implementation of nameless write [1]• Benefits

- Increase QD in a single zone to improve parallelism• Host changes

- Changes needed to block allocator in application / file system- Full reimplementation of LBA-based features in submission path (e.g., snapshotting)

[1] Y. Zhang, L. P. Arulraj, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. De-indirection for flash-based SSDs with nameless writes. FAST, 2012.

Write Path• Required QD=1

- Reminder: No ordering guarantees in NVMe

Append Path• QD<=X / X = #LBAs in zone

Die 0 Die 1 Die 2 Die 3 Die k-1

Zone 0Zone1

Zone n-1…

Die 4 Die 5 Die 6 …

ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 ZSLBA + ZCAP - 1…

Write Pointer

AppendCommands

(SQ)

AppendCommands

(CQ)

LBA=

ZSLB

A+3

LBA=

ZSLB

A+4

LBA=

ZSLB

A+ZC

AP-1

Die 0 Die 1 Die 2 Die 3 Die k-1

Zone 0Zone1

Zone n-1…

Die 4 Die 5 Die 6 …

ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 ZSLBA + ZCAP - 1…

Write Pointer

Write Command(QD1)

QD1

Page 5: Zoned Namespaces - SNIA

4

ZNS: Host Responsibilities

ZNS requires changes to the host• First step: Changes to file systems

- Advantages in log-structured file systems (e.g., F2FS, XFS, ZFS)- Still best-effort from application perspective- First important step for adoption

• Second step: Changes to applications- Only make sense to log-structure applications (e.g., LSM databases such as RocksDB)- Real benefits in terms of WAF, OP and data movement (GC)- Need a way to minimize application changes! (Spoiler Alert: xNVMe)

Zone Mapping (L2P)I/O Scheduling Metadata Mgmt.

Exception Mgmt.Media Mgmt.Zone Meta. Mgmt.

HostZNS SSD

Zone FTL

Metadata Mgmt.

Zone AddressMapping (L2L)

Zone Data Placement

Zone Garbage Collection

I/O: Read / Write / Append / Reset / Commit

Address Mapping

GarbageCollection

Zone Metadata Mgmt.

Admin: Zone Management

Exception Handling

Sect. Mapping (L2P)I/O Scheduling

Exception Mgmt.Media Mgmt.Metadata Mgmt.

HostTraditional SSD FTL

I/O: Read / Write / Trim / Hints

Metadata Mgmt.

AddressMapping

GarbageCollection

Garbage CollectionOver Provisioning

Page 6: Zoned Namespaces - SNIA

5

ZNS Extensions: Zone Random Write Area (ZRWA)Concept• Expose the write buffer in front of a zone to the host

Benefits• Enable in-place updates within ZRWA

- Reduce WAF due to metadata updates- Aligns with metadata layout of many file systems- Simplifies recovery (not changes to metadata layout)

• Allows writes at QD > 1- No need changes to host stack (as opposed to append)- ZRWA size balanced with expected performance

Operation• Writes are placed into the ZRWA

- No write pointer constraint• In-place updates allowed in the ZRWA window• ZRWA can be committed explicitly

- Through dedicated command• ZRWA can be committed implicitly

- When writing over sizeof(ZRWA)

Zone 0 Zone 1 Zone 2 Zone n-1

…LBA 0 LBA m-1

ZRWA ZRWA ZRWA ZRWACommit

Read / Write / Commit

Write

ZONE

In-place Updates Allowed

Explicit Commit

Implicit Commit

ZRWA

Page 7: Zoned Namespaces - SNIA

6

ZNS Extensions: Simple Copy

Concept• Offload copy operation to SSD• Simple copy, because SCSI XCOPY become to complex

Benefits• Data is moved by the SSD controller

- No data movement through interconnect (e.g., PCIe)- No CPU cycles used for data movement

• Specially relevant for ZNS – host GCOperation• Host forms and submits SC command

- Select a list of source LBA ranges- Select a destination with a single LBA + length

• SSD moves data from sources to destination- Error model complexity depending on scope

Scope• Under discussion in NVMe

Simple Copy

Zone 1

Zone 3

Zone 2

Page 8: Zoned Namespaces - SNIA

7

ZNS: Archival Use Case

Goal: Reduce TCO• Reduce WAF• Reduce OP• Reduce DRAM usage• Facilitate consumption of QLC

Adoption• Tiering of cold storage

- Denmark cold: ZNS SSDs- Finland cold: High-capacity SMR HDDs- Mars cold: Tape

• Leverage SMR ecosystem- Build on top of same abstractions- Expand use-cases

Die 0 Die 1 Die 2 Die 3 Die k-1

Zone 1Zone 2

Zone n-1…

Die 4 Die 5 Die 6 …

Met

adat

a Zone 0

Zone

M

appi

ngApplication 1

Zone 0Zone 7

Zone 14

Zone 245…

Application 2Zone 10Zone 5

Zone 12

Zone 351…

Large Zones - Large Erase block - Large I/O sizes - Few “streams”Internal data placementDevice Scheduling

1 zone:1 applicationTarget low GC

Configuration optimized for cold data• Large zones• Semi-immutable per-zone data (all / nothing)• Minimize metadata and mapping tables• Need for large QDs (Append or ZRWA)

Architecture properties• Host GC à Less WAF and OP• Zone Mapping table à Less DRAM (break 1GB / 1TB)

Page 9: Zoned Namespaces - SNIA

8

ZNS: I/O Predictability Use CaseGoal: Support multi-tenant workloads• Inherit archival benefits:

- ↓WAF, ↓OP, and ↓DRAM• Support QoS policies

- Provide isolation properties across zones- Enable host I/O scheduling- Allow more control over I/O submission / completion

Die 0 Die 1 Die 2 Die 3 Die k-1Die 4 Die 5 Die 6 …

Met

adat

a

Zone

M

appi

ng

Application 1Zone 0Zone kZone 2k

Zone n-k…

Application 2Zone 1

Zone k+1Zone 2k+1

Zone n-1…

Small Zones - Small Erase block - Variable I/O sizes - 10s / 100s “streams”Host data placementHost I/O Scheduling

1 isolation domain:1 application

Target low GC

Zone 0 Zone 1 Zone k-1Zone k Zone k+1 Zone 2k+1

Zone n-k Zone n-k+1 Zone n-1

……

…… ……

Manage Stripping

Configuration optimized for multi-tenancy• Zones grouped in “isolation domains”• Host data placement / stripping• Host I/O scheduling

Inherit archival architecture properties• Host GC à Less WAF and OP• Zone Mapping table à Less DRAM (break 1GB / 1TB)

Page 10: Zoned Namespaces - SNIA

9

Linux Stack: Zoned Block FrameworkPurpose• Built for SMR drivers• Add support for write constraints• Expose to applications through

block layerKernel Status• Merged in 4.9• Block / drivers

- Support in block layer- Support in SCSI / ATA- Support in null_blk

• File systems- F2FS- Btrfs (patches posted)- ZFS / XFS (ongoing)

• Libraries- libzbc

• Tools- Utils-linux, fio

SAS ZBC / SATA ZAC HDD

SCSI / ATA

Host

SCSI / ATA Drivers

Block Layer

HW

KernelSpace

UserSpace

Zoned-aware FS(e.g., F2FS, Btrfs, ZFS)

Zoned Block Device Zoned Block Device

libzbc

ApplicationsApplicationApplication

dm-zoned

Page 11: Zoned Namespaces - SNIA

10

Linux Stack: Adding ZNS support

ZNS SSD

Kern

el S

pace

Use

r Spa

ce

io_uring libaio / sync IOCTL

NVMe DriverZoned Block + ZNS

Block LayerZoned Block

ZNS

ZNSZNS ZNS

F2FSZNS

XFSZoned + ZNS

FS I/O FS I/O Raw Block I/O

Admin

Admin & I/O

LegacyApplications

Zoned FS(e.g., F2FS)

Zoned Applications(e.g., RocksDB)

xNVMe SPDK nvme-cliZNS

blkzonedZNS

FS (e.g., F2FS)ZNS

mq-

dead

line

fioZNS

xNVMe

blktestsZNS

xNVMeRocksDBZNS Env.

QEMUZNS

FrameworkKernel-Space ZNS extensionsUser-space ZNS extensions

ZNS SSDKe

rnel

Spa

ceU

ser S

pace

io_uring / AIO

NVMe Driver

Block Layer

F2FSR / W

RocksDB RocksDB

mq-

dead

line

R / WSQ / CQ

R / W / A

Zone Mgmt

Posix xNVMe

io_uring / AIO

R / W / ASQ / CQ

zone

d

bloc

k

Zone Mgmt (e.g., Reset)

Admin

Block Layer

zone

d

bloc

k

R / W / A

Zoned Env. Zone Mgmt

IOCTL

Zone Mgmt (e.g., Reset)

ZNS support in Zoned Block• Add support ZNS-specific features (kernel & tools)

- Append, ZRWA and Simple Copy- Async() path for new admin commands

• Relax assumptions from SMR- Conventional zones- Zone sizes, state transitions, etc.

I/O paths• Using zoned FS (e.g., F2FS)

- No changes to application• Using zoned application

- Support in application backend- Explicit zone management• Data placement, sched., zone mgmt.

Page 12: Zoned Namespaces - SNIA

11

xNVMe: Supporting non-block ApplicationsxNVMe: Library to enable non-block applications across I/O backends• Common API and helpers• Linux kernel path: psync, libaio, io_uring• SPDK: User-space in Linux and FreeBSD• Windows: Future work – new backend required

Benefits• Portable API across I/O backends and OSs (minimal benefit for block devices)• Common API for new I/O interfaces (Re-write application backend only once)• File-system semantics on raw devices (block & non-block)• No performance impact

Page 13: Zoned Namespaces - SNIA

12

TakeawaysZNS is a new interface in NVMe• Builds on top of Namespace Types (e.g., KV)• Driven by vendors and early customers

ZNS targets 2 main use-cases • Archival (original use case)• I/O Predictability (ZNS extensions)

Host needs changes• Basic functionality builds on top of existing zoned block framework

- Respect write constraints- Follow zone state machine: Reset->Open->Closed->Full / ->Offline / ->ReadOnly

• Append needs major changes to block allocators in I/O path- Might work for some file systems / applications, but not for all

• ZRWA gives an alternative to append- Same functionality, not changes to block allocator in I/O path

We are building open ecosystem• Linux stack to be released on TP ratification

- xNVMe, Linux Kernel, tools (e.g., nvme-cli), fio- RocksDB Backend on xNVMe

• Sponsoring TPs for ZNS extensions in NVMe

Page 14: Zoned Namespaces - SNIA