ceph performance optimization.pptx
TRANSCRIPT
1
Ceph on All-Flash Storage – Breaking Performance BarriersAxel RosenbergDirector ISV & Partner EMEAJune, 2016
Old Model Monolithic, large upfront
investments, and fork-lift upgrades Proprietary storage OS Costly: $$$$$
New SD-AFS Model Disaggregate storage, compute, and software for
better scaling and costs Best-in-class solution components Open source software - no vendor lock-in Cost-efficient: $
Software-defined All-Flash StorageThe disaggregated model for scale
Data Center Solutions 3
Software Defined All-Flash Storagethe Disaggregated Model for Scale
Services
Software
Compute
Networking
Storage
SanDisk Flash
SanDisk Flash
SanDisk Flash
Data Center Solutions 4
SanDisk FlashStart™ and FlashAssure™• Installation and Training Services• 24/7, Global onsite TSANET
Collaborative Solutions Support• 2 hr Parts delivery – 750+ global
locations
Software Defined All-Flash Storagethe Disaggregated Model for Scale
Services
Software
Compute
Networking
Storage
SanDisk Flash• Shared Flash Storage
INFINIFLASH• Flash in Server
SanDisk Flash
SanDisk Flash
SanDisk Flash Compute Choice
InfiniFlash IF550 All-Flash Storage System Block and Object Storage Powered by Ceph
Ultra-dense High Capacity Flash storage• 512TB in 3U, Scale-out software for PB scale capacity
Highly scalable performance• Industry leading IOPS/TB
Cinder, Glance and Swift storage• Add/remove server & capacity on-demand
Enterprise-Class storage features • Automatic rebalancing• Hot Software upgrade• Snapshots, replication, thin provisioning• Fully hot swappable, redundant
Ceph Optimized for SanDisk Flash• Tuned & Hardened for InfiniFlash
InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for Hardware• Ceph modifications for Flash• Both Ceph, Host OS tuned for
InfiniFlash• SW defects that impacts Flash
identified & mitigated
Hardware Configured for Software• Right balance of CPU, RAM,
Storage• Rack level designs for optimal
performance & cost
Software designed for all systems does not work well with any system Ceph has over 50 tuning
parameters that results in 5x – 6x performance improvement
Fixed CPU, RAM hyperconverged nodes does not work well for all workloads
7
Began in summer of ‘13 with the Ceph Dumpling release
Ceph optimized for HDD• Tuning AND algorithm changes needed for Flash optimization
• Leave defaults for HDD
Quickly determined that the OSD was the major bottleneck• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
Examined and rejected multiple OSDs per SSD• Failure Domain / Crush rules would be a nightmare
Optimizing Ceph for the All-Flash Future
8
Context switches matter at flash rates• Too much “put it in a queue for another thread”
• Too much lock contention
Socket handling matters too!• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
Lots of other simple things• Eliminate repeated look-ups in maps, caches, etc.
• Eliminate Redundant string copies (especially return of string)
• Large variables passed by value, not const reference
Contributed improvements to Emperor, Firefly and Giant releases
Now obtain about >80K IOPS / OSD using around 9 CPU cores/ (Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
9
Write path strategy was classic HDD• Journal writes for minimum foreground latency
• Process journal in batches in the background
The batch oriented processing was very inefficient on flash
Modified buffering/writing strategy for Flash• Recently committed to Jewel release
• Yields 2.5x write throughput improvement over Hammer
• Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization
10
RDMA intra-cluster communication• Significant reduction in CPU / IOP
BlueStore• Significant reduction in write amplification -> even higher write
performance
Memory allocation• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
Erasure Coding for Blocks (native)* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements
InfiniFlash Ceph Performance- Measured on a 256TB Cluster
Test Cluster Topology - 8 node cluster with Infiniflash, 256TB
Test Configuration Configuration
Sandisk Confidential
8 Node Cluster ( 8 drives shared to each OSD node)
OSD Node 8 Servers(SuperMicro) 2x E5-2697 14C 2.6Hz v3 , 8x 16GB DDR4 ECC 2133Mhz
1x Mellanox X3 Dual 40GbE
Client 8-10 Servers(Dell R620) 2 x E5-2680 10C 2.8GHz 25M$ 4 x 16GB RDIMM, dual rank x4 (64 GB)
1x Mellanox X3 Dual 40GbE Storage – Infiniflash
Infiniflash Infiniflash connected with full pop in A8 topology. Total storage - 4 TB * 64 BSSDs = 256 TB
FW 2.1.1.0.RC
Network Details40G Switch Brocade VDX 8770 OS Details OS Ubuntu 14.04.02 LTS 64bit 3.16.0-70-genericLSI card/ driver SAS3008Mellanox 40gbps nw card ConnectX-3Cluster Configuration CEPH Version 0.94.5 sndk-ifos-1.2.1.21
Replication 2 Host level replication.
Number of Pools, PGs & RBDs pool = 10 ;PG = 1536 per pool 1 RBD from each poolNote : Can change for various hardware configurations
RBD size 2TB Number of Monitors 1 Recommend to use 3 or more for deploymentNumber of OSD Nodes 8 Number of OSDs per Node 8 total OSDs = 8 * 8 = 64
IFOS PERFORMANCE : librbdEnvironment:• Librbd IO perf is collected on a 8 node cluster connected to zoned to a single Infiniflash chassis with replication of
2.• 10 x 2TB rbd images from 10 pools and have 10 client machines.• Cluster capacity : 40TB (10 rbds * 2 tb of each rbd * 2 way replication)• Raw Pre-condition drives for 24 hours • Cluster Pre-condition: 256k seq write IO to fill images • Performance measured using fio with rbd engine and each fio instance with 10 jobs and 40 shards.• OSD nodes with Cluster Public & Private network
Tools:• fio : 2.2
IF500 LIBRBD IO PERFORMANCE
Analysis• Ceph has been tuned to achieve 75% of bare metal performance• All 8 hosts CPU saturated for 4K Random read.
More performance potential with higher CPU cycles
• With 64k & higher IO size we are able to utilize full IF150 bandwidth of over 12GB/s.• Librbd and Krbd performance are comparable.
4 64 2560
200000400000600000800000
10000001200000140000016000001800000
0
2
4
6
8
10
12
141557264
20773948298
6.2
13.212.3
Random Read
Total IOPSBlock Size
IOPs
Band
wid
th G
B/s
4 64 2560
20000400006000080000
100000120000140000160000180000200000
0
0.5
1
1.5
2
2.5
3
3.5174021
44571
122030.6
2.83
Random Write
Total IOPSBlock Size
IOPs
Band
wid
th G
B/s
IFOS Performance : IOPs/Latency/QD
4k Random read : Avg latency(µsec)
4k Random write : Avg latency(µsec)
Environment• Librbd IO read latency measured on single Infiniflash chassis(64 BSSDs) with 2 way replication at host level having 8 osd nodes.• fio Read IO profile :4k block with 4 num Jobs with 10 clients• fio Write IO profile :4k block with 4 num Jobs with 6 clients
1 2 4 8 16 32 64 1284k
0200000400000600000800000
10000001200000140000016000001800000
0
500
1000
1500
2000
2500
3000
3500
Total IOPsAvg Latency
IO depth per Client Numjobs:4 Clients:10
IOPs
Avg
lat (
us)
1 2 4 8 16 32 64 1284k
020000400006000080000
100000120000140000160000180000200000
020004000600080001000012000140001600018000
Total IOPsAvg Latency
IO depth per Client Numjobs:4 Clients:6
IOPs
Avg
lat (
us)
Average latency : 2.1ms Average latency : 0.67ms
InfiniFlash for OpenStack with Dis-Aggregation
Compute & Storage Disaggregation enables Optimal Resource utilization
Allows for more CPU usage required for OSDs with small Block workloads
Allows for higher bandwidth provisioning as required for large Object workload
Independent Scaling of Compute and Storage Higher Storage capacity needs doesn't’t force you to
add more compute and vice-versa
Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B
OSDs
SAS
….
HSEB A HSEB B HSEB A HSEB B
….
Com
pute
Far
m LUN LUN
iSCSI Storage
…Obj Obj
Swift ObjectStore
…LUN LUN
Nova with Cinder & Glance
…
LibRBD
QEMU/KVM
RGW
WebServer
KRBD
iSCSI Target
OSDs OSDs OSDs OSDs OSDs
Stor
age
Farm
Confidential – EMS Product Management
Customer ExampleRunning Ceph in an OpenStack environment. It's general-purpose. One of the apps is running tv and movie listings for pay-per-view. Each set-top box talks to a VM when it needs updating (next page of listings, etc.) - their STBs are interactive with the OpenStack cloud
19Confidential – For Internal Use Only
Before With IF550….. Customer deployed OpenStack based private cloud
on RH Ceph
Pain-points
• Cinder storage based on Ceph being used on HDD could not get over 10K IOPS (8K blocks)
• Limiting private cloud expansion to include higher performance applications
Solutions considered before IF550 did not work..
• Alternate solutions with separate storage architectures for high performance applications add significant costs and higher management overhead defeating the purpose of OpenStack
Able to meet their higher performance goals with the same IF550 Ceph-based architecture without disturbing their existing infrastructure
• Ceph cluster on HDD will co-exist with IF550 Ceph cluster. Applications get deployed on either based on performance needs
• Initial workloads migrated: Deploying Splunk log Analytics Apache Kafka messaging system Cassandra (Next target)
Adding performance in lower TCO footprint• 10x128TB IF500 expanding to 256TB in next
phase• Expected to reduce real estate to < 1/3 • >50% power reduction expected
US Cable Services ProviderOpenStack Private Cloud on InfiniFlash IF500
20Confidential – For Internal Use Only
High Performance OpenStack Private Cloud Infrastructure on InfiniFlash
Cinder Storage driven by Ceph on IF550 First migrate latency sensitive workloads
Splunk Log Analytics, Kafka Messaging runs on InfiniFlash
10x performance over HDD based Ceph Co-exists with low performance HDD based
Ceph Highly scalable with lowest TCO
2-copy model - reduced from 3 copy HDD model
Ease of deployment w. reduced footprint, power, thermal management HSEB A HSEB B
SAS HSEB A HSEB B
Com
pute
Far
m LUN LUN
Splunk
LibRBD
QEMU/KVM
Stor
age
Farm
LUN LUN
Kafka
LibRBD
QEMU/KVM….
OpenStack Private Cloud on IF550 – con’t
InfiniFlash Hybrid Storage Solutions
Infiniflash and HDD array chassis (Ex: WD 4U60) co-exist in a single cluster serving separate storage tiers
Tiered storage Pools can be accessed by the same application or independent applications
Enables application migration between storage tiers
Primary copy
Application
Ceph Client
HDD array
Replicated pool
Journal SSD
Read+Write I/O
HDD array
Replicated pool
Journal SSD
• InfiniFlash serves as the primary copy with higher affinity and serves most of the reads
• HDD arrays with SSD journals to serve as replicas• Lowers the write performance
• Performance degradation mitigated by SSD journals and sequential write workloads of HDD arrays
Tiered Storage Pools HDD Replicas
Tier 1 Storage Pool Tier 2 Storage Pool
HDD array
Application
Ceph Client
Read+Write I/O (High performance)
Read+Write I/O(low performance)
HDD array
Ceph Client Ceph Client
IF550 - Enhancing Ceph for Enterprise ConsumptionIF550 provides usability and performance utilities without sacrificing Open Source principles
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality
• All Ceph Performance improvements developed by SanDisk are contributed back to community
22
SanDisk Distribution or
Community Distribution
Out-of-the Box configurations tuned for performance with Flash
Sizing & planning tool
InfiniFlash drive management integrated into Ceph management (Coming Soon)
Ceph installer that is specifically built for InfiniFlash High performance iSCSI storage
Better diagnostics with log collection tool Enterprise hardened SW + HW QA
23
There is a SMART attribute which can be obtained through IFCLI commands called “Lifetime LBAs written” and can be monitored periodically to see if it is approaching close to the MAX LBA limit (also obtained through ifcli). Here is a sample output from a drive:
24
25
The InfiniFlash™ System
26
64-512 TB JBOD of flash in 3U
Up to 2M IOPS, <1ms latency, Up to 15 GB/s throughput
Energy Efficient ~400W power draw
Connect up to 8 servers
Simple yet Scalable
Data Center Solutions 27
InfiniFlash IF100
8TB Flash-Card Innovation• Enterprise Grade Power-Fail Safe• Latching integrated & monitored• Directly samples air temp• Form factor enables lowest cost SSD
Non-disruptive Scale-Up & Scale-Out• Capacity on demand
• Serve high growth Big Data• 3U chassis starting at 64TB up to
512TB• 8 to 64 8TB Flash Cards (SAS)
• Compute on demand• Serve dynamic apps without
IOPS/TB bottlenecks• Add up to 8 servers
Flash Card Performance** Read Throughput > 500MB/s WriteThroughput > 300MB/s Read IOPS > 20K IOPS Random Write
@4K > 15K IOPS
Flash Card Integration Alerts and monitoring Latching integrated
and monitored Integrated air temperature
sampling
InfiniFlash SystemCapacity 512TB* raw All-Flash 3U Storage System 64 x 8TB Flash Cards with Pfail 8 SAS ports total
Operational Efficiency and Resilient Hot Swappable components, Easy
FRU Low power 450W(avg), 750W(active)
MTBF 1.5+ million hours
Scalable Performance 2M IOPS 6-15GB/s Throughput
InfiniFlash TCO Advantage
Tradtional ObjStore on HDD
IF500 ObjStore w/ 3 Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
3 year TCO comparison *
3 year OpexTCA
Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies
0102030405060708090
100Total Rack
Reduce the replica count with higher reliability of flash
- 2 copies on InfiniFlash vs. 3 copies on HDD
InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs
- Flash allows the use of erasure coded storage pool without performance limitations
- Protection equivalent of 2x storage with only 1.2x storage
Power, real estate, maintenance cost savings over 5 year TCO
* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment
29
Open Source with SanDisk Advantage – Enterprise Level Hardened Ceph
Innovation and speed of Open Source with the trustworthiness of Enterprise grade and Web-Scale testing, hardware optimization
Performance optimization for flash andhardware tuning
Hardened and tested for Hyperscale deployments and workloads
Risk mitigation through long term support and a reliable long term roadmap
Continual contribution back to the community
Enterprise Level Hardening
Testing at Hyperscale
FailureTesting
9.000 hours of cumulative IO tests
1.100+ unique test cases
1.000 hours of Cluster Rebalancing tests
1.000 hours of IO on iSCSI
Over 100 server node clusters
Over 10PB of Flash Storage
2.000 Cycle Node Reboot
1,000 times Node Abrupt Power Cycle
1.000 times Storage Failure
1.000 times Network Failure
IO for 2.500 hours at a stretch
31
Thank You! @BigDataFlash #bigdataflash
©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).
Emerging Storage Solutions (EMS) SanDisk Confidential 32
NEW STORE GOALS More natural transaction atomicity Avoid double writes Efficient object enumeration Efficient clone operation Efficient splice (“move these bytes from object X to object Y”) Efficient IO pattern for HDDs, SSDs, NVMe Minimal locking, maximum parallelism (between PGs) Advanced features
full data and metadata checksums compression
Emerging Storage Solutions (EMS) SanDisk Confidential 33
We can't overwrite a POSIX file as part of a atomic transaction Writing overwrite data to a new file means many files for each object Write-ahead logging?
put overwrite data in a “WAL” records in RocksDB commit atomically with transaction then overwrite original file data but we're back to a double-write of overwrites...
Performance sucks again Overwrites dominate RBD block workloads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
Emerging Storage Solutions (EMS) SanDisk Confidential 34
ROCKSDB: JOURNAL RECYCLING (2) Put old log files on recycle list (instead of deleting them) LogWriter
overwrite old log data with new log data include log number in each record
LogReader stop replaying when we get garbage (bad CRC) or when we get a valid CRC but record is from a previous log incarnation
Now we get one log append → one IO! Upstream in RocksDB!
but missing a bug fix (PR #881) Works with normal file-based storage, or BlueFS