architecting flash for application acceleration and memory efficiency by nisha talagala
DESCRIPTION
SanDisk Fellow, Nisha Talagala, discusses the evolution of flash memory and why its perfect for application acceleration.TRANSCRIPT
2
Forward-Looking Statements
During our meeting today we may make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to industry trends, future memory technology and product capabilities and performance. Information in this presentation may also include or be based upon information from third parties, which reflects their expectations and projections as of the date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.
3
Non Volatile Memory (NVM) Flash (NAND)
– 100s GB to 10 TB per device
– Trends - Higher capacity, lower cost/GB, lower write cycles, SLC->MLC->3BPC
– 100K to millions of IOPS, GB/s of bandwidth
PCM/MRAM/STT/Other NVMs
– Still in research
4
Why Flash? Perfect match for
Databases
Low latency – good for low queue I/O
Handles mixed sequential and random I/O at various block sizes much better than disk
▸ Capacity
▸ IOPS
▸ Cost/IOP
4TB 3TB
15 200,000
$$$$ ¢¢¢¢
5
Evolution of Flash UsageFLASH + DISK FLASH AS DISK FLASH BEYOND DISK FLASH AS MEMORY
Increased flash awareness
Better Transactions/$/Watt
6
Flash Beyond Disk: Not Just Fast DiskArea Hard Disk Drives Flash Devices
Read/Write Performance Largely symmetrical Heavily asymmetrical. Additional operation (erase)
Sequential vs Random Performance 100x difference. Elevator scheduling for disk arm
<10x difference. No disk arm – NAND die
Mapping and Background ops Rare Regular occurrence – like a log structured file system
Wear out Largely unlimited Limited writes
IOPS 100s to 1,000s 100Ks to Millions
Latency 10s ms 10s-100s us
7
Flash-Awareness: Opportunities in the I/O Stack
Flash devices –I/O and Primitives (Atomics, PTRIM, etc)- Memory, Persistent Memory
File System (XFS, Ext4, Btrfs, NVMFS)
Apps
APIs, Libraries
TransparentOptimization
Flash-AwareOptimization
8
Examples covered in this talk- MySQL- MongoDB- LevelDB
Gigaspaces with ZetaScale™ Software – See OPEN Session 301-A: Flash Changes the Game in Application Performance and TCO (Enterprise Applications Track)
9
Example: MySQL
10
Flash as Disk: Tune for a Really Fast Disk Has been going on now for
several years with great results
Targeted data placement, NOOP scheduler, seek-less optimizations, concurrency improvements, etc.
Make the block I/O stack faster
Find the fastest file-system
11
Multi-Instance: Using all the IOPS
Instances1 2 4
Thro
ughp
ut, N
OT/
10se
c
0
4000
6000
8000
10000
12000
Fusion-io, 48 threads, 2400W – 64GB BP
4810
8788
11952
2000
12
Atomic WritesTraditional MySQL Writes MySQL with Atomic Writes
Page CPage B
Page A
Buffer
DRAMBuffer
SSD (or HDD) Database
Database Server
Page CPage BPage A
Page CPage BPage A
Page CPage BPage A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to double-write buffer on the media.
3
Once step 3 is acknowledged, MySQL writes the updates to the actual tablespace.
4
ioMemory Database
Page CPage BPage A
DRAMBuffer
Page CPage BPage A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to actual tablespace, bypassing the double-write buffer step due to inherent atomicity guaranteed by the (intelligent) device.
3
Database Server
Page CPage B
Page A
13
Performance with Atomic Writes
Double-write disabled – Non-ACID
Double-write – ACID
Without Atomic Writes With Atomic I/O
Twice the performance AND with ACID properties
Atomic writes – ACID
• Atomic writes at 99% of the performance of raw writes• 2x flash device endurance improvement
14
4 204 404 604 804 10041204140416041804200422042404260428043004320434040
20
40
60
80
100
120
140
160
180
200
Sysbench 99% Latency OLTP workload
XFS DoubleWriteDirectFS Atomic
Seconds
Mill
iseco
nds
Atomic Writes: Latency Improvement2-4x Latency Improvement on Percona Server
Atomic Writes
15
NVM-Compression Uses natural thin-provisioning of the
underlying flash No bit packing at MySQL, no
rebalancing Holes in data files are
TRIMed/unmapped Multi-threaded flush and atomics to
reduce latency Pluggable compression algorithms
ioMemory-VSL
NVMFS
MySQL
16
Performance Improvement (LinkBench)
Uncompressed Row Compression NVM-Compression0%
10%20%30%40%50%60%70%80%90%
100%
100%
20%
90%
Transaction Rate Compression Performance Penalty
17
Performance Improvement: (TPCC-like)
Time 230 460 690 920 1150 1380 1610 1840 2070 2300 2530 2760 2990 3220 34500
5000
10000
15000
20000
25000
30000
TPC-C like workloadMariaDB 10
1,000 warehouses - 75GB DRAM
MySQL uncompressedMySQL compressionFusion-io Compression
New Order TX
Time
86%
18
Capacity Efficiency and Endurance
Row-comp Page-comp
Series1 0.489986187845304 0.584584947449068
45.0%47.0%49.0%51.0%53.0%55.0%57.0%59.0%
Vs. Uncompressed *
% im
prov
emen
t
*For LinkBench with lz77. Comparable results with lz4.
• Architecturally more storage efficient than row compression
• When combined with Atomic Writes, can improve flash endurance ~4x
19
Flash Aware API Integration for Database
20
Example: MongoDB
21
Reducing DRAM with FlashMongoDB Throughput (operations/second)
24 GB Node 12 GB Node 8 GB Node0
2000
4000
6000
8000
-26% -33%
2x DRAM Reduction 3x DRAM Reduction
22
Improving Performance Predictability while reducing DRAM
12GB 2x DRAM Reduction
8GB 3x DRAM Reduction
0
1000
2000
3000
4000
5000
6000
Improved Predictability with Less DRAM
24G12G8G
24GB Baseline
23
Improving Performance Predictability while reducing DRAM
Throughput Deviation (operations/second)
24 GB Node 12 GB Node 8 GB Node0
100
200
300
400
2x DRAM Reduction 3x DRAM Reduction
3x Improvement
24
Improving Performance Predictability with Less DRAMLatency Deviation (microsecond)
S e r i e s 10
3
6
9
12
Update Latency Deviation Read Latency Deviation
2x DRAM Reduction 3x DRAM Reduction
Tighter consistency even with 3x less DRAM
25
Example: LevelDB
26
LevelDB, NVMKV, and Raw Block Device
Fusion-io flash aware stack can bring full performance of flash to KV stores
27Fusion-io Confidential
Improving endurance and lifetime with Flash Awareness
RW: Random Async WriteRW-S: Random Sync WriteSW: Sequential Async WriteSW-S: Sequential Sync Write
Improving endurance by 2x-40x with flash aware key value storage
28
Performance Predictability
29
Transaction throughput (YCSB)
Throughput improvements of 50%-300% while using less DRAM
30
Availability Atomic Writes
– MariaDB mainline >= 5.5.31– Percona Server >= 5.5.31– Oracle MySQL >= 5.7.4
NVM Compression– MariaDB – 10.0.9– Percona Server – 5.6 – Oracle MySQL – labs release (http://labs.mysql.com/)
NVMFS available early access MongoDB – works with existing MongoDB releases NVMKV – available at opennvm.github.io
31
https://opennvm.github.io
http://www.opencompute.org/projects/storage
32
Thank You
33
Appendix
34
File System Support
POSIX Interface Functionality
fallocate(offset, len) Preallocation and extending files/table space
fallocate(PUNCH_HOLE) Unmap/Punch hole operation.(issue Persistent TRIMs)
io_submit() AIO Transparent Atomic writes
34
NVM-Compression uses POSIX compliant Interfaces
NVM-Compression speed from NVMFS file system
35
NVMFS Non Volatile Memory FileSystem A POSIX compliant filesystem, designed by Fusion-io
Strengths
– Efficient pre-allocation of large files– Pre-allocation of large files is efficient– No fragmentation/degradation with aging of filesystem– Near raw flash speeds for I/O, atomic writes and TRIM
36
©2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. ZetaScale is a trademark of SanDisk Enterprise IP LLC. Fusion-IO, ioDrive and ioDrive Duo are trademarks of Fusion-IO, Inc. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their holder(s).