ceph performance optimization.pptx

1

Ceph on All-Flash Storage – Breaking Performance BarriersAxel RosenbergDirector ISV & Partner EMEAJune, 2016

Old Model Monolithic, large upfront

investments, and fork-lift upgrades Proprietary storage OS Costly: $$$$$

New SD-AFS Model Disaggregate storage, compute, and software for

better scaling and costs Best-in-class solution components Open source software - no vendor lock-in Cost-efficient: $

Software-defined All-Flash StorageThe disaggregated model for scale

Data Center Solutions 3

Software Defined All-Flash Storagethe Disaggregated Model for Scale

Services

Software

Compute

Networking

Storage

SanDisk Flash

SanDisk Flash

SanDisk Flash


SanDisk FlashStart™ and FlashAssure™• Installation and Training Services• 24/7, Global onsite TSANET

Collaborative Solutions Support• 2 hr Parts delivery – 750+ global

locations

Software Defined All-Flash Storagethe Disaggregated Model for Scale

Services

Software

Compute

Networking

Storage

SanDisk Flash• Shared Flash Storage

INFINIFLASH• Flash in Server

SanDisk Flash

SanDisk Flash

SanDisk Flash Compute Choice

InfiniFlash IF550 All-Flash Storage System Block and Object Storage Powered by Ceph

Ultra-dense High Capacity Flash storage• 512TB in 3U, Scale-out software for PB scale capacity

Highly scalable performance• Industry leading IOPS/TB

Cinder, Glance and Swift storage• Add/remove server & capacity on-demand

Enterprise-Class storage features • Automatic rebalancing• Hot Software upgrade• Snapshots, replication, thin provisioning• Fully hot swappable, redundant

Ceph Optimized for SanDisk Flash• Tuned & Hardened for InfiniFlash

InfiniFlash SW + HW Advantage

Software Storage System

Software tuned for Hardware• Ceph modifications for Flash• Both Ceph, Host OS tuned for

InfiniFlash• SW defects that impacts Flash

identified & mitigated

Hardware Configured for Software• Right balance of CPU, RAM,

Storage• Rack level designs for optimal

performance & cost

Software designed for all systems does not work well with any system Ceph has over 50 tuning

parameters that results in 5x – 6x performance improvement

Fixed CPU, RAM hyperconverged nodes does not work well for all workloads

7

Began in summer of ‘13 with the Ceph Dumpling release

Ceph optimized for HDD• Tuning AND algorithm changes needed for Flash optimization

• Leave defaults for HDD

Quickly determined that the OSD was the major bottleneck• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)

Examined and rejected multiple OSDs per SSD• Failure Domain / Crush rules would be a nightmare

Optimizing Ceph for the All-Flash Future

8

Context switches matter at flash rates• Too much “put it in a queue for another thread”

• Too much lock contention

Socket handling matters too!• Too many “get 1 byte” calls to the kernel for sockets

• Disable Nagle’s algorithm to shorten operation latency

Lots of other simple things• Eliminate repeated look-ups in maps, caches, etc.

• Eliminate Redundant string copies (especially return of string)

• Large variables passed by value, not const reference

Contributed improvements to Emperor, Firefly and Giant releases

Now obtain about >80K IOPS / OSD using around 9 CPU cores/ (Hammer) *

SanDisk: OSD Read path Optimization

* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs

9

Write path strategy was classic HDD• Journal writes for minimum foreground latency

• Process journal in batches in the background

The batch oriented processing was very inefficient on flash

Modified buffering/writing strategy for Flash• Recently committed to Jewel release

• Yields 2.5x write throughput improvement over Hammer

• Average latency is ½ of Hammer

SanDisk: OSD Write path Optimization

10

RDMA intra-cluster communication• Significant reduction in CPU / IOP

BlueStore• Significant reduction in write amplification -> even higher write

performance

Memory allocation• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.

default *

Erasure Coding for Blocks (native)* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view

SanDisk: Potential Future Improvements

InfiniFlash Ceph Performance- Measured on a 256TB Cluster

Test Cluster Topology - 8 node cluster with Infiniflash, 256TB

Test Configuration Configuration

Sandisk Confidential

8 Node Cluster ( 8 drives shared to each OSD node)

OSD Node 8 Servers(SuperMicro) 2x E5-2697 14C 2.6Hz v3 , 8x 16GB DDR4 ECC 2133Mhz

1x Mellanox X3 Dual 40GbE

Client 8-10 Servers(Dell R620) 2 x E5-2680 10C 2.8GHz 25M$ 4 x 16GB RDIMM, dual rank x4 (64 GB)

1x Mellanox X3 Dual 40GbE Storage – Infiniflash

Infiniflash Infiniflash connected with full pop in A8 topology. Total storage - 4 TB * 64 BSSDs = 256 TB

FW 2.1.1.0.RC

Network Details40G Switch Brocade VDX 8770 OS Details OS Ubuntu 14.04.02 LTS 64bit 3.16.0-70-genericLSI card/ driver SAS3008Mellanox 40gbps nw card ConnectX-3Cluster Configuration CEPH Version 0.94.5 sndk-ifos-1.2.1.21

Replication 2 Host level replication.

Number of Pools, PGs & RBDs pool = 10 ;PG = 1536 per pool 1 RBD from each poolNote : Can change for various hardware configurations

RBD size 2TB Number of Monitors 1 Recommend to use 3 or more for deploymentNumber of OSD Nodes 8 Number of OSDs per Node 8 total OSDs = 8 * 8 = 64

IFOS PERFORMANCE : librbdEnvironment:• Librbd IO perf is collected on a 8 node cluster connected to zoned to a single Infiniflash chassis with replication of

2.• 10 x 2TB rbd images from 10 pools and have 10 client machines.• Cluster capacity : 40TB (10 rbds * 2 tb of each rbd * 2 way replication)• Raw Pre-condition drives for 24 hours • Cluster Pre-condition: 256k seq write IO to fill images • Performance measured using fio with rbd engine and each fio instance with 10 jobs and 40 shards.• OSD nodes with Cluster Public & Private network

Tools:• fio : 2.2

IF500 LIBRBD IO PERFORMANCE

Analysis• Ceph has been tuned to achieve 75% of bare metal performance• All 8 hosts CPU saturated for 4K Random read.

More performance potential with higher CPU cycles

• With 64k & higher IO size we are able to utilize full IF150 bandwidth of over 12GB/s.• Librbd and Krbd performance are comparable.

4 64 2560

200000400000600000800000

10000001200000140000016000001800000

0

2

4

6

8

10

12

141557264

20773948298

6.2

13.212.3

Random Read

Total IOPSBlock Size

IOPs

Band

wid

th G

B/s

4 64 2560

20000400006000080000

100000120000140000160000180000200000

0

0.5

1

1.5

2

2.5

3

3.5174021

44571

122030.6

2.83

Random Write

Total IOPSBlock Size

IOPs

Band

wid

th G

B/s

IFOS Performance : IOPs/Latency/QD

4k Random read : Avg latency(µsec)

4k Random write : Avg latency(µsec)

Environment• Librbd IO read latency measured on single Infiniflash chassis(64 BSSDs) with 2 way replication at host level having 8 osd nodes.• fio Read IO profile :4k block with 4 num Jobs with 10 clients• fio Write IO profile :4k block with 4 num Jobs with 6 clients

1 2 4 8 16 32 64 1284k

0200000400000600000800000

10000001200000140000016000001800000

0

500

1000

1500

2000

2500

3000

3500

Total IOPsAvg Latency

IO depth per Client Numjobs:4 Clients:10

IOPs

Avg

lat (

us)

1 2 4 8 16 32 64 1284k

020000400006000080000

100000120000140000160000180000200000

020004000600080001000012000140001600018000

Total IOPsAvg Latency

IO depth per Client Numjobs:4 Clients:6

IOPs

Avg

lat (

us)

Average latency : 2.1ms Average latency : 0.67ms

InfiniFlash for OpenStack with Dis-Aggregation

Compute & Storage Disaggregation enables Optimal Resource utilization

Allows for more CPU usage required for OSDs with small Block workloads

Allows for higher bandwidth provisioning as required for large Object workload

Independent Scaling of Compute and Storage Higher Storage capacity needs doesn't’t force you to

add more compute and vice-versa

Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B

OSDs

SAS

….

HSEB A HSEB B HSEB A HSEB B

….

Com

pute

Far

m LUN LUN

iSCSI Storage

…Obj Obj

Swift ObjectStore

…LUN LUN

Nova with Cinder & Glance

…

LibRBD

QEMU/KVM

RGW

WebServer

KRBD

iSCSI Target

OSDs OSDs OSDs OSDs OSDs

Stor

age

Farm

Confidential – EMS Product Management

Customer ExampleRunning Ceph in an OpenStack environment. It's general-purpose. One of the apps is running tv and movie listings for pay-per-view. Each set-top box talks to a VM when it needs updating (next page of listings, etc.) - their STBs are interactive with the OpenStack cloud

19Confidential – For Internal Use Only

Before With IF550….. Customer deployed OpenStack based private cloud

on RH Ceph

Pain-points

• Cinder storage based on Ceph being used on HDD could not get over 10K IOPS (8K blocks)

• Limiting private cloud expansion to include higher performance applications

Solutions considered before IF550 did not work..

• Alternate solutions with separate storage architectures for high performance applications add significant costs and higher management overhead defeating the purpose of OpenStack

Able to meet their higher performance goals with the same IF550 Ceph-based architecture without disturbing their existing infrastructure

• Ceph cluster on HDD will co-exist with IF550 Ceph cluster. Applications get deployed on either based on performance needs

• Initial workloads migrated: Deploying Splunk log Analytics Apache Kafka messaging system Cassandra (Next target)

Adding performance in lower TCO footprint• 10x128TB IF500 expanding to 256TB in next

phase• Expected to reduce real estate to < 1/3 • >50% power reduction expected

US Cable Services ProviderOpenStack Private Cloud on InfiniFlash IF500

20Confidential – For Internal Use Only

High Performance OpenStack Private Cloud Infrastructure on InfiniFlash

Cinder Storage driven by Ceph on IF550 First migrate latency sensitive workloads

Splunk Log Analytics, Kafka Messaging runs on InfiniFlash

10x performance over HDD based Ceph Co-exists with low performance HDD based

Ceph Highly scalable with lowest TCO

2-copy model - reduced from 3 copy HDD model

Ease of deployment w. reduced footprint, power, thermal management HSEB A HSEB B

SAS HSEB A HSEB B

Com

pute

Far

m LUN LUN

Splunk

LibRBD

QEMU/KVM

Stor

age

Farm

LUN LUN

Kafka

LibRBD

QEMU/KVM….

OpenStack Private Cloud on IF550 – con’t

InfiniFlash Hybrid Storage Solutions

Infiniflash and HDD array chassis (Ex: WD 4U60) co-exist in a single cluster serving separate storage tiers

Tiered storage Pools can be accessed by the same application or independent applications

Enables application migration between storage tiers

Primary copy

Application

Ceph Client

HDD array

Replicated pool

Journal SSD

Read+Write I/O

HDD array

Replicated pool

Journal SSD

• InfiniFlash serves as the primary copy with higher affinity and serves most of the reads

• HDD arrays with SSD journals to serve as replicas• Lowers the write performance

• Performance degradation mitigated by SSD journals and sequential write workloads of HDD arrays

Tiered Storage Pools HDD Replicas

Tier 1 Storage Pool Tier 2 Storage Pool

HDD array

Application

Ceph Client

Read+Write I/O (High performance)

Read+Write I/O(low performance)

HDD array

Ceph Client Ceph Client

IF550 - Enhancing Ceph for Enterprise ConsumptionIF550 provides usability and performance utilities without sacrificing Open Source principles

• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality

• All Ceph Performance improvements developed by SanDisk are contributed back to community

22

SanDisk Distribution or

Community Distribution

Out-of-the Box configurations tuned for performance with Flash

Sizing & planning tool

InfiniFlash drive management integrated into Ceph management (Coming Soon)

Ceph installer that is specifically built for InfiniFlash High performance iSCSI storage

Better diagnostics with log collection tool Enterprise hardened SW + HW QA

23

There is a SMART attribute which can be obtained through IFCLI commands called “Lifetime LBAs written” and can be monitored periodically to see if it is approaching close to the MAX LBA limit (also obtained through ifcli). Here is a sample output from a drive:

The InfiniFlash™ System

26

64-512 TB JBOD of flash in 3U

Up to 2M IOPS, <1ms latency, Up to 15 GB/s throughput

Energy Efficient ~400W power draw

Connect up to 8 servers

Simple yet Scalable


InfiniFlash IF100

8TB Flash-Card Innovation• Enterprise Grade Power-Fail Safe• Latching integrated & monitored• Directly samples air temp• Form factor enables lowest cost SSD

Non-disruptive Scale-Up & Scale-Out• Capacity on demand

• Serve high growth Big Data• 3U chassis starting at 64TB up to

512TB• 8 to 64 8TB Flash Cards (SAS)

• Compute on demand• Serve dynamic apps without

IOPS/TB bottlenecks• Add up to 8 servers

Flash Card Performance** Read Throughput > 500MB/s WriteThroughput > 300MB/s Read IOPS > 20K IOPS Random Write

@4K > 15K IOPS

Flash Card Integration Alerts and monitoring Latching integrated

and monitored Integrated air temperature

sampling

InfiniFlash SystemCapacity 512TB* raw All-Flash 3U Storage System 64 x 8TB Flash Cards with Pfail 8 SAS ports total

Operational Efficiency and Resilient Hot Swappable components, Easy

FRU Low power 450W(avg), 750W(active)

MTBF 1.5+ million hours

Scalable Performance 2M IOPS 6-15GB/s Throughput

InfiniFlash TCO Advantage

Tradtional ObjStore on HDD

IF500 ObjStore w/ 3 Full Replicas on Flash

IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies

$-

$10,000,000

$20,000,000

$30,000,000

$40,000,000

$50,000,000

$60,000,000

$70,000,000

$80,000,000

3 year TCO comparison *

3 year OpexTCA

Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash

IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies

0102030405060708090

100Total Rack

Reduce the replica count with higher reliability of flash

- 2 copies on InfiniFlash vs. 3 copies on HDD

InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs

- Flash allows the use of erasure coded storage pool without performance limitations

- Protection equivalent of 2x storage with only 1.2x storage

Power, real estate, maintenance cost savings over 5 year TCO

* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment

29

Open Source with SanDisk Advantage – Enterprise Level Hardened Ceph

Innovation and speed of Open Source with the trustworthiness of Enterprise grade and Web-Scale testing, hardware optimization

Performance optimization for flash andhardware tuning

Hardened and tested for Hyperscale deployments and workloads

Risk mitigation through long term support and a reliable long term roadmap

Continual contribution back to the community

Enterprise Level Hardening

Testing at Hyperscale

FailureTesting

9.000 hours of cumulative IO tests

1.100+ unique test cases

1.000 hours of Cluster Rebalancing tests

1.000 hours of IO on iSCSI

Over 100 server node clusters

Over 10PB of Flash Storage

2.000 Cycle Node Reboot

1,000 times Node Abrupt Power Cycle

1.000 times Storage Failure

1.000 times Network Failure

IO for 2.500 hours at a stretch

31

Thank You! @BigDataFlash #bigdataflash

©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Emerging Storage Solutions (EMS) SanDisk Confidential 32

NEW STORE GOALS More natural transaction atomicity Avoid double writes Efficient object enumeration Efficient clone operation Efficient splice (“move these bytes from object X to object Y”) Efficient IO pattern for HDDs, SSDs, NVMe Minimal locking, maximum parallelism (between PGs) Advanced features

full data and metadata checksums compression


We can't overwrite a POSIX file as part of a atomic transaction Writing overwrite data to a new file means many files for each object Write-ahead logging?

put overwrite data in a “WAL” records in RocksDB commit atomically with transaction then overwrite original file data but we're back to a double-write of overwrites...

Performance sucks again Overwrites dominate RBD block workloads

NEWSTORE FAIL: ATOMICITY NEEDS WAL


ROCKSDB: JOURNAL RECYCLING (2) Put old log files on recycle list (instead of deleting them) LogWriter

overwrite old log data with new log data include log number in each record

LogReader stop replaying when we get garbage (bad CRC) or when we get a valid CRC but record is from a previous log incarnation

Now we get one log append → one IO! Upstream in RocksDB!

but missing a bug fix (PR #881) Works with normal file-based storage, or BlueFS

ceph performance optimization.pptx

Documents