smore: a cold data object store for smr drives · a cold data object store for smr drives david...
TRANSCRIPT
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 1
SMORE:A Cold Data Object Store for SMR Drives
David SlikNetApp, Inc.
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 2
Session Overview
A brief overview of SMR Slides 3 - 7 What is SMORE Slides 8 - 10 Architectural elements Slides 11 - 18 Implementation results Slides 19 - 22 Lessons learned Slides 23 - 24
Questions & Answers
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 3
Shingled Magnetic Recording (SMR)
Technique to increase hard drive density 2.3x originally cited [1], but real-world has been around 1.5x
Trades off reduced write flexibility for increased capacity Concept has been around for over a decade Widely productized (many millions of drives shipped) Products often implement SMR “under the covers”, with firmware
hiding limitations from higher-level software (drive-managed SMR)
[1] “High density data storage using shingle-write”, Proceedings of the IEEE International Magnetics Conference, 2009
Not a new technology
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 4
Shingled Magnetic Recording (SMR)
Tracks on conventional hard drives are separated by “guard bands” Guard band prevents writes from effecting adjacent tracks Takes up physical space on the hard drive
How does it work?
Width of write head
Width of read head
Guard Band
Track N
Track N+1
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 5
Shingled Magnetic Recording (SMR)
SMR hard drives eliminate guard bands, “shingle” data writes Allows tracks to be packed closer together, increasing density Takes advantage of smaller read heads than write heads Write region of track N+1 overlaps write region of track N
How does it work?
Width of write head
Width of read head
Track N
Track N+1
Track N+2
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 6
Shingled Magnetic Recording (SMR)
Primary tradeoff is re-writing data overwrites adjacent track(s) Drive divided into “zones”, each of which can be independently
appended to, and independently erased (“Trimmed”) Requires append-only data writes. Current write position known
as the “write pointer”
How does it work?
Width of write head
Width of read head
Track N+1
Track N+2
Track N
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 7
Shingled Magnetic Recording (SMR)
Three general classes of SMR drives are defined: Drive managed (where the drive hides SMR complexity) Host managed (where a host is responsible for SMR complexity) Host aware (Hybrid of the two, where drive manages violations)
SMORE (SMR Object Repository) assumes Host managed SMR
In the context of this presentation
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 8
SMR Object Repository (SMORE)
A project from the Advanced Technology Group at NetApp Primary research objectives included:
Identify if SMR drives can achieve hardware limits for cold storage workloads
Provide an experimental platform for investigating novel fault and error recovery techniques
Results published at: https://arxiv.org/pdf/1705.09701.pdf
Project Overview
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 9
SMR Object Repository (SMORE)
Project assumptions Flash is for hot data, disk is for cool data, tape is for cold data Hot/warm data will be served from flash tier, so cool reads are random Cool data workloads are primarily object-based
Write (PUT) once, complete object replacement (versioning) Read (GET) infrequently, with range requests Delete (DELETE) infrequently, long-lived objects
Streaming reads and writes at aggregate drive throughput Flash enables new architectures, so can be used judiciously User-space software
Project Overview
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 10
SMR Object Repository (SMORE)Project Overview
User-space SMR Driver
SCSI Generic Driver
STDLIB
VFS
Array of SMR Drives Flash
SMORE
NVRAM FIFO Buffer
User Requests
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 11
SMR Object Repository (SMORE)
Index Each object has an identifier Need a place to translate ID to location(s) on disk This mapping information is stored in a B+ Tree on flash Index information is stored as part of each object to enable
reconstruction
Architectural Elements
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 12
SMR Object Repository (SMORE)
Segmenting Objects split into segments Allows arbitrarily large
objects Fragmenting
Each segment is erasurecoded into N fragments
Each fragment is stored ondifferent physical hard drive
Architectural Elements
Fragment 2AFragment 1A
Fragment 2BFragment 1B
Fragment 2CFragment 1C
Fragment 2NFragment 1N
Disk 1, Zone α
Disk 2, Zone β
Disk 3, Zone γ
Disk 14, Zone ν
Segment 1 Segment 2 Digest
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 13
SMR Object Repository (SMORE)
Each SMR Drive has M zones A system with X drives has
a total of X * M zones Zones are grouped into sets
such that no two zones areone the same drive
E.g. {α, β, γ, … ν} These are “Zone Sets” Zone set width =< X
Architectural Elements
Fragment 2AFragment 1A
Fragment 2BFragment 1B
Fragment 2CFragment 1C
Fragment 2NFragment 1N
Disk 1, Zone α
Disk 2, Zone β
Disk 3, Zone γ
Disk 14, Zone ν
Segment 1 Segment 2 Digest
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 14
SMR Object Repository (SMORE)
Zone sets are allocated Stored in a zone set table Array of {disk, zone} tuples
E.g. Zone Set 1 includes zoneson disks {1,3,6,7, … 47}
Zone Set 2 includes zoneson disks {2,3,5,7, … 46,48}
Etc.
Architectural Elements
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 15
SMR Object Repository (SMORE)
Assume disk 5 fails Affects Zone Sets 2, 3 EC prevents data loss
All Zone Sets with a zone ondisk 5 can be easily identified
New zones can be allocated Data can be rebuilt Only zone set table is updated
Architectural Elements
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 16
SMR Object Repository (SMORE)
De-clustering reduces dataloss in multi-fault scenarios
E.g. disk triple disk fault with16/18 RS EC: 23.5% prob a given zone set has 0 lost 45.3% prob a given zone set has 1 lost 26.5% prob a given zone set has 2 lost 4.7% prob a given zone set has 3 lost
Prioritize rebuilding of ZoneSets with 2 lost zones
Architectural Elements
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 17
SMR Object Repository (SMORE)
Layout Marker Blocks atbeginning of each fragmentcontain recovery informationabout the segment, used forrecovering partially writtenzone setsDigests at end of zone containsummary of all segments inthe zone, used for recoveringfilled zone sets
Architectural Elements
Fragment 2AFragment 1A
Fragment 2BFragment 1B
Fragment 2CFragment 1C
Fragment 2NFragment 1N
Disk 1, Zone α
Disk 2, Zone β
Disk 3, Zone γ
Disk 14, Zone ν
Segment 1 Segment 2 Digest
Layout Marker Blocks
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 18
SMR Object Repository (SMORE)
Index Recovery Index is discarded and rebuilt on a crash Allows index to not have to be replicated, lowers cost Design approach is to make rebuilding the index fast & simple Indexes are checkpointed and stored in dedicated zone sets Any zone sets closed since the checkpoint are replayed using
the zone set digest Any zone sets opened since the checkpoint are replayed by
reading through a zone to replay layout marker blocks
Architectural Elements
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 19
SMR Object Repository (SMORE)
Implementation shows that indexrecovery time depends on thenumber of dirty zone sets
Trade extra I/O for faster recovery Implementation numbers are
serialized, but can easily beparallelized, resulting inperformance increasing bynumber of drives
Implementation Results
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 20
SMR Object Repository (SMORE)
Even in worst case scenariowhere every zone set has beendirtied, rebuild is bounded
In real-world deployments,more frequent checkpointswould bound number of dirtyzone sets in a rebuild
If index lost, rebuild must readdigest from every zone set
Implementation Results
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 21
SMR Object Repository (SMORE)
For write I/Os, we achieved100% of theoretical bandwidth,regardless of object size
For random read I/Os, seeklatency constrains throughputfor small objects
For sufficiently large objects,we achieved close to 100% oftheoretical bandwidth
Implementation Results Aggregate Read Performance
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 22
SMR Object Repository (SMORE)
Exceeding SMR workload limitsrequire low write amplification
The implementation has low write amplification results, evenwith worst-case workloads
For cold data storage, deletesare infrequent, resulting in farlower write amplification levels
Implementation Results
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 23
SMR Object Repository (SMORE)
General lessons Parallelism is critical for performance Data spreading is critical for fault tolerance Optimal Zone Set width changes based on number of drives Simplifying fault recovery dramatically reduces implementation
and testing complexity Use of flash storage for indexes, metadata, etc, dramatically
improves performance while reducing complexity
Lessons Learned
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 24
SMR Object Repository (SMORE)
SMR-specific lessons Append-only model simplifies implementation and increases
write performance NVRAM needed to reliably buffer data due to SMR disk block
sizes (Can be omitted if minimum object size enforced) NVRAM improves mixed I/O performance by reducing seeks SMR disk write performance often significantly lower than read
performance. Easy to exceed SMR disk workload limits
Lessons Learned
2017 Storage Developer Conference. © NetApp, Inc. All Rights Reserved. 25
Questions & Answers
Questions from the audience
My contact information:[email protected]