연세대학교Yonsei Univer-sity
Data Processing Systems for Solid State
Drive
Yonsei UniversityMincheol Shin
2015.11.23
Overview
• Main Target : Data Processing Systems with SSD
• Purpose : Improving I/O Performance
• Data Processing System– Relational Database Management System
• e.g. Oracle, MySQL, PostgreSQL, SQLite
– Distributed Data Processing System• e.g. Hadoop Distributed File System, MapReduce, Hive, Hbase, Tajo,
Spark
– Key-value Store• e.g. Redis
Outline
• Solid State Drive (SSD)• RDBMS on Solid State Drive• Big Data Processing for Solid State Drive
Solid State Drive: Flash Memory [VLDB2011Tut2]
• Great Performance !!– High I/O Performance: 41 MB/s Read, 7.5 MB/s Program [Micron 2014]
– Fast Random Access: Under 0.1 ms (HDD: 2.9 to 12 ms)
– Low Energy Consumption
• Four Constraints of NAND Flash Memory– C1: Program granularity (2KB~16KB)
– C2: Must erase a block before updating a page (256KB ~ 1MB)
– C3: Pages must be programmed sequentially within a block
– C4: Limited lifetime (104 ~ 105)
4k Page4k Page
A Erase Block (1 MB)
[VLDB2011Tut2] P. Bonnet, L. Bouganim, I. Koltsidas, S. D. Viglas, VLDB 2011 Tutorial: System Co-Design and Data management for Flash Devices
Solid State Drive
• Solid State Drive (SSD)– Definition: Persistent data storage without disks nor a drive motor.– Support Traditional Block I/O
• Characteristics for SSD– Fast Random Access (inherited from flash memory)– Read/Write Imbalance (inherited from flash memory)– Exploiting Internal Parallelism (SSD internal structure)– In-Storage Processing
SSD
HostI/F
(SATA, SAS, PCIE)
Read(addr)
Write(addr, data)
Internal Algorithm (FTL)
Mapping
Wear leveling
Garbage Collection
Physical Storage
Flash Chips
Flash Chips
Flash Chips
Flash Chips
Flash Chips
Flash Chips
ReadPro-gramErase
Solid State Drive: Flash Translation Layer (FTL)
• Flash Translation Layer– Convert the block I/O operations to internal operations
– Three Major Components • Mapping
– Map Logical Block Address(LBA) to physical page
• Garbage Collection
• Wear Leveling– To extend lifetime of SSD
Logical
Physical
Block 1 Block 2 Block 3 Block 4
Update
v v v v I I v I v v
Block 2 Block 3 Block 4
v v v v I I I I v v v
Block 2 Block 3 Block 4
Erase
Solid State Drive: Internal Parallelism
• SSD can read/write the data in parallel
SSD
HostI/F
(SATA, SAS, PCIE)
Flash Package
Flash Package
Flash Package
Flash Package
Flash Package
Flash Package
Flash Package
Flash Package
Channel-level Parallelism(N Parallel Channels)
Package-level parallelism(Interleaving)
Memory
Time
Read 1 Transfer 1
Read 3 Transfer 3
Read 5 Transfer 5
Read 7 Transfer 7
Read 2 Transfer 2
Read 4 Transfer 4
Read 6 Transfer 6
Read 8 Transfer 8
Package 1 (Ch. 1)
Package 2 (Ch. 1)
Package 3 (Ch. 2)
Package 4 (Ch. 2)
Channel 1
Channel 2 Data 2 Data 4 Data 6 Data 8
Data 1 Data 3 Data 5 Data 7
Solid State Drive: Internal Parallelism
• Using internal parallelism, SSD achieves – High performance for sequential I/O
• Similar to Striping (RAID 0)• Seq. bw for SATA SSD
– Write : 450 MB/s– Read : 500 MB/s
– High performance for concurrent I/O
[VLDB2012Roh] H. Roh, S. Park, S. Kim, M. Shin, S-W. Lee,B+-tree index optimization by exploiting internal parallelism of flash-based Solid State Drives
Solid State Drive: In-Storage Processing
• SSD has CPU and Memory for FTL
• Host Interface is bottleneck !– H/I has lower bandwidth than internal bandwidth of SSD
• Two approaches– Light-weight filter in SSD
• Transfer smaller data through H/F• Filter tuples using predicates
– Sub-modules in SSD• e.g. Transaction management with COW
• Need special SSD to implement ISP– OpenSSD, SmartSSD and so or
DBMS on Solid State Drive
• Main research areas:– Buffer Management– Index Management– Query Processing– Transaction Management
• Most of researches using SSDs focused on storage I/O
DBMS on Solid State Drive: Index Management
• FD-tree– Exploit sequential bandwidths of SSDs– B-Tree + sorted runs
• PIO B-tree– Exploit internal parallelism of
SSDs– Access to multiple B-tree node
along multiple paths
DBMS on Solid State Drive: Query Processing
• FlashJoin: PAX based query processing – NSM layout
• Most typical page layout• Tuples are stored in a contiguous
region
– PAX layout• Values of columns are stored
in contiguous region (minipage)• Originally, PAX is designed for reducing cache miss in CPU cache
– FlashScan reads only needed minipages– FlashJoin joins minipages read by flashScan
DBMS on Solid State Drive: Query Processing
• FMSort– Exploit internal parallelism of SSD– During merge phase,
DBMS on Solid State Drive: Transaction Mgmt.
• X-FTL: Shadow Paging in SSD– Writing operations of SSD is similar to Copy-on-write
• When a page is updated, the modified page is written to an empty page.• And then, invalidate old page
– X-FTL maintains old pages until transaction is committed.– There is no copying the original pages
Big Data on Solid State Drive
• 3 approaches to improve performance using SSDs– Complete replacement
• Higher cost per capacity
– Selective replacement• e.g. intermediate results on SSDs, HDFS data on HDDs
– SSD as a cache• Commercial/Noncommercial cache SW exist• Open source : bcache, flashcache, enhanced IO, DM-cache • Project with SK Telecom
• Archival Storage of HDFS– Store replica into 4 tiers of storage
• ARHIVE : slowest and biggest capacity storage (petabyte of storage)• DISK, SSD, RAM_DISK• https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.ht
ml#Storage_Types:_ARCHIVE_DISK_SSD_and_RAM_DISK
• Issues– Industry leads Big Data processing platform area– There is no standard model– Because CPU overhead are too high