ceph deployment at target: customer spotlight

Ceph Deployment at Target:Customer Spotlight

Agenda

2

Welcome to Surly!

• Introduction• Ceph @ Target

• Initial FAIL• Problems Faced• Solutions• Current Implementation

• Future Direction• Questions

Agenda

3

Introduction

Will BoegeLead Technical ArchitectEnterprise Private Cloud Engineering

Agenda

4

@ First Ceph Environment at Target went live in October of 2014

• “Firefly” Release

Ceph was backing Target’s first ‘official’ Openstack release• Icehouse Based• Ceph is used for:

• RBD for Openstack Instances and Volumes• RADOSGW for Object (instead of Swift)• RBD backing Celiometer MongoDB volumes

Replaced traditional array-based approach that was implemented in our prototype Havana environment.

• Traditional storage model was problematic to integrate• General desire at Target to move towards open solutions• Ceph’s tight integration with Openstack a huge selling point

Agenda

5

@ Initial Ceph Deployment:

• Dev + 2 Prod Regions• 3 x Monitor Nodes – Cisco B200• 12 x OSD Nodes – Cisco C240 LFF

• 12 4TB SATA Disks• 10 OSD per server• Journal partition co-located on each OSD disk• 120 OSD Total = ~ 400 TB

• 2x 10GBE per host• 1 public_network • 1 cluster_network

• Basic LSI ‘MegaRaid’ controller – SAS 2008M-8i• No supercap or cache capability onboard• 10xRAID0

Initial Rollout

Post rollout it became evident that there were performance issues within the environment.

• KitchenCI users would complain of slow Chef converge times• Yum transactions / app deployments would take abnormal amounts of time to

complete.• Instance boot times, especially cloud-init images would take excessively long time to

boot, sometimes timing out.• General user griping about ‘slowness’• Unacceptable levels of latency even while cluster was relatively unworked

• High levels of CPU IOWait%• Poor IOPS / Latency - FIO benchmarks running INSIDE Openstack Instances$ fio --rw=write --ioengine=libaio --runtime=100 --direct=1 --bs=4k --size=10G --iodepth=32 --name=/tmp/testfile.bin

test: (groupid=0, jobs=1): err= 0: pid=1914read : io=1542.5MB, bw=452383 B/s, iops=110 , runt=3575104msecwrite: io=527036KB, bw=150956 B/s, iops=36 , runt=3575104msec

Initial Rollout

Compounding the performance issues we began to see mysterious reliability issues.

• OSDs would randomly fall offline• Cluster would enter a HEALTH_ERR state about once a week with ‘unfound

objects’ and/or Inconsistent page groups that required manual intervention to fix.

• These problems were usually coupled with a large drop in our already suspect performance levels

• Cluster would enter a recovery state often bringing client performance to a standstill

Initial Rollout

Customer opinion of our Openstack deployment due to Ceph…..

…which leads to perception of the team....

Maybe Ceph isn’t the right solution?

What could we have done differently??

• Hardware• Monitoring• Tuning• User Feedback

Ceph is not magic. It does the best with the hardware you give it!Much ill-advised advice floating around that if you just throw enough crappy disks at Ceph you will achieve enterprise grade performance. Garbage in – Garbage out. Don’t be greedy and build for capacity, if your objective is to create more a performant block storage solution.

Fewer Better Disks > More ‘cheap’ Disks

….depending on your use case.

Hardware

• Root cause of HEALTH_ERRs was “unnamed vendor’s” SATA drives in our solution ‘soft-failing’ – slowly gaining media errors without reporting themselves as failed. Don’t rely on SMART. Interrogate your disks with a array-level tool, like MegaRAID to identify drives for proactive replacement.

$ opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep Media

• In installations with co-located journal partitions, a RAID solution with cache+BBU for writeback operation would have been a huge performance gain.

Paying more attention to the suitability of hardware our vendor of choice provided would have saved a lot of headaches

Hardware

Monitor Your Implementation From The Outset!Ceph provides a wealth of data about the cluster state! Unfortunately, only the most rudimentary data is exposed by the ‘regular’ documented commands.

Quick Example - Demo Time!

Calamari?? Maybe a decent start … but we quickly outgrew it.

Challenge your monitoring team. If your solution isntworking for you – go SaaS. Chatops FTW. Developthose ‘Special Sets Of Skills’!

In monitoring –

ease of collection > depth of feature set

Monitors

require 'rubygems'require 'dogapi'

api_key = “XXXXXXXXXXXXXXXXXXXXXXXX"

dosd = `ceph osd tree | grep down | wc -l`host = `hostname`

if host.include?("ttb") envname = "dev" elsif host.include?("ttc") envname = "prod-ttc" else envname = "prod-tte”end

dog = Dogapi::Client.new(api_key)dog.emit_point("ceph.osd_down", dosd, :tags => ["env:#{envname}","app:ceph"])

Monitors

#!/bin/bash

# Generate Write Resultswrite_raw=$(fio --randrepeat=1 --ioengine=libaio --direct=1 --name=./test.write --filename=test \--bs=4k --iodepth=4 –-size=1G --readwrite=randwrite --minimal)

# Generate Read Resultsread_raw=$(fio --randrepeat=1 --ioengine=libaio --direct=1 --name=./test.read --filename=test \--bs=4k --iodepth=4 –size=1G --readwrite=randread --minimal)

writeresult_lat=$(echo $write_raw | awk -F\; '{print $81}')writeresult_iops=$(echo $write_raw | awk -F\; '{print $49}')readresult_lat=$(echo $read_raw | awk -F\; '{print $40}')readresult_iops=$(echo $read_raw | awk -F\; '{print $8}')

ruby ./submit_lat_metrics.rb $writeresult_iops $readresult_iops $writeresult_lat $readresult_lat)

Monitors

Tune to your workload!• This is unique to your specific workloads

But… in general.....

Neuter the default recovery priority[osd]osd_max_backfills = 1osd_recovery_priority = 1osd_client_op_priority = 63osd_recovery_max_active = 1osd_recovery_max_single_start = 1

Limit the impact of deep scrubingosd_scrub_max_interval = 1209600osd_scrub_min_interval = 604800osd_scrub_sleep = .05osd_snap_trim_sleep = .05osd_scrub_chunk_max = 5osd_scrub_chunk_min = 1osd_deep_scrub_stride = 1048576osd_deep_scrub_interval = 2592000

Tuning

Get Closer to Your Users!Don’t Wall Them Off With Process!• Chatops!

• Ditch tools like Lync / Sametime.• ’1 to 1’ Enterprise Chat Apps are dead men walking. • Consider Slack / Hipchat • Foster an Enterprise community around your tech with available

tools• REST API integrations allow far more robust notifications of

issues in a ‘stream of consciousness’ fashion.

Quick Feedback

Quick Feedback

Agenda

19

Improved Hardware• OSD Nodes – Cisco C240M4 SFF

• 20 10k Seagate SAS 1.1TB• 6 480g Intel S3500 SSD

• We have tried the ’high-durability’ Toshiba SSDthey seem to work pretty well.

• Journal partition on SSD with 5:1 OSD/Journal ratio• 90 OSD Total = ~ 100 TB

• Improved LSI ‘MegaRaid’ controller – SAS-9271-8i• Supercap• Writeback capability• 18xRAID0• Writethru on journals, writeback on spinning OSDs.• Based on “Hammer” Ceph Release

After understanding that slower, high capacity disk wouldn't meet our needs for an Openstack general purpose block storage solution – we rebuilt.

Current State

• Obtaining metrics from our design change was nearly immediate due to having effective monitors in place

– Latency improvements have been extreme– IOWait% within Openstack instances have been greatly reduced– Raw IOPS throughput has sykrocketed– Throughput testing with RADOS bench and FIO shows aprox. 10 fold increase– User feedback has been extremely positive, general Openstack experience at

Target is much improved. Feedback enhanced by group chat tools. – Performance within Openstack instances has increase about 10x

Results

test: (groupid=0, jobs=1): err= 0: pid=1914 read : io=1542.5MB, bw=452383 B/s, iops=110 , runt=3575104msec write: io=527036KB, bw=150956 B/s, iops=36 , runt=3575104msec

test: (groupid=0, jobs=1): err= 0: pid=2131 read : io=2046.6MB, bw=11649KB/s, iops=2912 , runt=179853msec write: io=2049.1MB, bw=11671KB/s, iops=2917 , runt=179853msec

Current State

• Forcing the physical world to bend to our will. Getting datacenter techs to understand the importance of rack placements in modern ‘scale-out’ IT– To date our server placement is ‘what's the next open slot?’– Create a ‘rack’ unit of cloud expansion– More effectively utilize CRUSH for data placement and

availability

• Normalizing our Private Cloud storage performance offerings– ENABLE IOPS LIMITS IN NOVA! QEMU supports this natively. Avoid the all you can

eat IO buffet.nova-manage flavor set_key --name m1.small --key quota:disk_read_iops_sec --value 300nova-manage flavor set_key --name m1.small --key quota:disk_write_iops_sec --value 300

– Leverage Cinder as the storage ‘menu’ beyond the default offering.

• Experiment with NVME for journal disks – greater journal density.

• Currently testing all SSD pool performance– All SSD in Ceph has been maturing rapidly – Jewel sounds very promising. – We have needs for a ‘ultra’ Cinder tier for workloads that require high IOPS / low

latency for use cases such as Apache Cassandra– Considering Solidfire for this use case

Next Steps

Next Steps

• Repurposing legacy SATA hardware into a dedicated object pool– High capacity, low performance drives should work well in an object use case– Jewel has per-tenant namespaces for RADOSGW (!)

• Automate deployment with Chef to bring parity with our Openstack automation. ceph-deploy still seems to be working for us.– Use TDD to enforce base server configurations

• Broadening Ceph beyond cloud niche use case. Especially with improved object offering.

Next Steps

Next Steps

• Before embarking on creating a Ceph environment, have a good idea of what your objectives are for the environment.– Capacity?– Performance?

• If you make wrong decisions it can lead to a negative user perception of Ceph, and the technologies that depend on it, like Openstack

• Once you understand your objective, understand that your hardware selection is crucial to your success

• Unless you are architecting for raw capacity, use SSDs for your journal volumes without exception– If you must co-locate journals, use a RAID adapter with BBU+Writeback cache

• A hybrid approach may be feasible with SATA ‘capacity’ disks with SSD or NVME journals. I’ve yet to try this, I’d be interested in seeing some benchmark data on a setup like this

• Research, experiment, break stuff, consult with Red Hat / Inktank• Monitor, monitor, monitor and provide a very short feedback loop for your users

to engage you with their concerns

Conclusion

Conclusion

Thanks For Your Time!Questions?

&

ceph deployment at target: customer spotlight

Technology