build an high-performance and high-durable block storage service based on ceph

1. Build an High-Performance and High-DurableBlock Storage Service Based on CephRongze & Haomai[ rongze | haomai ]@unitedstack.com

2. CONTENTSBlock Storage Service13High Durability2High Performance4Operation Expericence 3. THE FIRST PART01 Block StorageService 4. Block Storage Service Highlight 6000 IOPS 170 MB/s 95% < 2ms SLA 3 copys, strong consistency, 99.99999999% durability All management ops in seconds Real-time snapshot Performance volume type and capacity volume type4 5. Software used5 6. 6NowOpenStack Essex Folsom Havana Icehouse/Juno JunoCeph 0.42 0.67.2 base on0.67.5base on0.67.5base on0.80.7CentOS 6.4 6.5 6.5 6.6Qemu 0.12 0.12 base on1.2base on1.2 2.0Kernel 2.6.32 3.12.21 3.12.21 ? 7. Deployment Architecture7 8. minimum deployment12 osd nodes and 3 monitor nodes 8 Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node40 Gb Switch 40 Gb Switch 9. Scale-out9 10. the minimum scale deployment Compute/Storage Node 40 Gb SwitchCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node40 Gb Switch12 osd nodes 11. Compute/Storage Node 40 Gb SwitchCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node Compute/Storage Node Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node40 Gb Switch144 osd nodes Compute/Storage Node Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node Compute/Storage Node Compute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage NodeCompute/Storage Node 12. OpenStack 13. novaVM VM1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins10 Gb Network: 20 GB / 1000 MB = 20 sLVMSANCeph20 GB Image 20 GB ImageLocalFSSwiftCephGlusterFSLocalFSNFSGlusterFScinder glanceHTTP HTTPBoot Storm 14. Nova, Glance, Cinder use the same ceph poolAll action in secondsNo boot stormcinder glance14novaVM VMCeph 15. 15QoS Nova Libvirt Qemu(throttle)Two Volume Types Cinder multi-backend Ceph SSD Pool Ceph SATA PoolShared Volume Read Only Multi-attach 16. THE SECOND PART02 HighPerformance 17. OS configure 18. CPU: Get out of CPU out of power save mode: echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null Cgroup: Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD) Memory: Turn off NUMA if support NUMA in /etc/grub.conf Set vm.swappiness = 0 Block: echo deadline/sys/block/sd[x]/queue/scheduler FileSystem Mount with noatime nobarrier 19. Qemu 20. Throttle: Smooth IO limit algorithm(backport) RBD enhance: Discard and flush enhance(backport) Burst: Virt-scsi: Multi-queue support 21. Ceph IO Stack 22. VCPUThreadQemu MainThreaddata flowPipe::Writer Pipe::Reader DispatchThreader OSD::OpWQ FileJournal::Writer FileJournal-finisherNetworkQemu OSDRadosClient-finisher DispatchThreader Pipe::Reader Pipe::Writer FileStore-ondisk_finisher FileStore-op_finisher FileStore::SyncThread FileStore::OpWQObject Object Object ObjectObject Object Object ObjectRBD ImageRadosObject File 23. Ceph Optimization 24. Rule 1: Keep FD Facts: FileStore Bottleneck: Remarkable performance degraded when FD cache missed SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory Action: Increase FDCache/OMapHeader to very large to hold all objects Increase object size to 16MB instead of 4MB(default) Improve default OS fd limits Configuration: filestore_omap_header_cache_size filestore_fd_cache_size rbd_chunk_size(OpenStack Cinder) 25. Rule 2: Sparse Read/Write Facts: Only few KB exists in Object for RBD usage Clone/Recovery will copy full object, harmful to performance and capacity Action: Use sparse read/write Problem: XFS or other local filesystems exists existing bugs for fiemap Configuration: filestore_fiemap=true 26. Rule 3: Drop default limits Facts: Default configuration value is suitable for HDD backend Action: Change all throttle-related configuration value Configuration: filestore_wbthrottle_* filestore_queue_* journal_queue_* More related configs(recovery, scrub) 27. Rule 4: Use RBD cache Facts: RBD cache has remarkable performanceimprovement for seq read/write Action: Enable RBD cache Configuration: rbd_cache = true 28. Rule 5: Speed Cache Facts: Default cache container implementation isnt suitable for large cachecapacity Temporary Action: Change cache container to RandomCache (Out of Master Branch) FDCache, OMap header cache, ObjectCacher Next: RandomCache isnt suitable for generic situations Implementation Effective ARC replacing RandomCache 29. Rule 6: Keep Thread Running Facts: Ineffective thread wakeup Action: Make OSD queue running Configuration: Still in Pull Request(https://github.com/ceph/ceph/pull/2727) osd_op_worker_wake_time 30. Rule 7: AsyncMessenger(experiment) Facts: Each client need two threads on OSD side Painful context switch latency Action: Use Async Messenger Configuration: ms_type = async ms_async_op_threads = 5 31. Result: IOPSBased on Ceph 0.67.5 Single OSD 32. Result: Latency 4K random write for 1TB rbd image: 1.05 ms per IO 4K random read for 1TB rbd image: 0.5 ms per IO 1.5x latency performance improvement Outstanding large dataset performance 33. THE THIRD PART03 HighDurability 34. Ceph Reliability Model https://wiki.ceph.com/Development/Reliability_model CRUSH: Controlled, Scalable, Decentralized Placement ofReplicated Data Copysets: Reducing the Frequency of Data Loss in Cloud Storage Ceph CRUSH codeDurability Formula P = func(N, R, S, AFR) P = Pr * M / C(R, N) 35. Whereneed to optimize ? 36. DataPlacement decides Durability CRUSH-MAP decides DataPlacement CRUSH-MAP decides Durability 37. Whatneed to optimize ? 38. Durability depend on OSD recovery time Durability depend on the number of Copy set in cephpoolthe possible PGs OSD set is Copy set,the data loss in ceph is loss of any PG, actually is loss ofany Copy set.If the replication number is 3 and we lost 3 osds in ceph,the probability of data loss depend on the number ofcopy set, because the 3 osds may be not the Copy set. 39. The shorter recovery time, the higher Durability The less the number of Copy set, the higher Durability 40. Howto optimize it? 41. The CRUSH-MAP setting decides osd recovery time The CRUSH-MAP setting decides the number of Copy set 42. Default CRUSH-MAP setting 43. server-01rootrack-01server-02server-03server-04server-05server-06server-07server-08server-09rack-02server-10server-11server-12server-13server-14server-15server-16server-17rack-03server-18server-19server-20server-21server-22server-23server-2424 OSDs 24 OSDs 24 OSDsif R = 3, M = 24 * 24 * 24 = 138243 racks24 nodes72 osds 44. default crush setting 45. N = 72S = 3 R = 1 R = 2 R = 3C(R, N) 72 2556 119280M 72 1728 13824Pr 0.99 2.1*10E-4 4.6*10E-8P 0.99 1.4*10E-4 5.4*10E-9Nines 3 8 46. Reduce recovery time 47. default CRUSH-MAP settingserver-08if one OSD out, only two OSDs can do data recoverysorecovery time is too longerweneed more OSD to do data recovery toreduce recovery time 48. use osd-domain instead of host bucketreduce recovery timeosd-domain 49. server-01rack-01server-02server-03server-04server-05server-06server-07server-08server-09rack-02server-10server-11server-12server-13server-14server-15server-16server-17rack-03server-18server-19server-20server-21server-22server-23server-24osd-domainosd-domainosd-domainosd-domainosd-domainosd-domainif R = 3, M = 24 * 24 * 24 = 13824 50. new crush map 51. N = 72S = 12 R = 1 R = 2 R = 3C(R, N) 72 2556 119280M 72 1728 13824Pr 0.99 7.8*10E-5 6.7*10E-9P 0.99 5.4*10E-5 7.7*10E-10Nines 4 9 52. Reduce the number of copyset 53. use replica-domain insteadof rack bucket 54. server-01rootrack-01server-02server-03server-04server-05server-06server-07server-08server-09rack-02server-10server-11server-12server-13server-14server-15server-16server-17rack-03server-18server-19server-20server-21server-22server-23server-24osd-domainosd-domainosd-domainosd-domainosd-domainosd-domainfailure-domainreplica-domainreplica-domainif R = 3, M = (12*12*12) * 2 = 3456 55. new crush map 56. N = 72S = 12 R = 1 R = 2 R = 3C(R, N) 72 2556 119280M 72 864 3456Pr 0.99 7.8*10E-5 6.7*10E-9P 0.99 2.7*10E-5 1.9*10E-10Nines 0 4 10 57. THE FOURTH PART04 OperationExpericence 58. deploy eNovance: puppet-ceph Stackforge: puppet-ceph UnitedStack: puppet-ceph shorter deploy time support all ceph options support multi disk type wwn-id instead of disk label hieradata 59. Operation goal: Availability reduce unnecessary data migration reduce slow requests 60. upgrade ceph1. noout: ceph osd set noout2. mark down: ceph osd down x3. restart: service ceph restart osd.x 61. reboot host1. migrate vm2. mark down osd3. reboot host 62. expand ceph capacity1. setting crushmap2. setting recovery options3. trigger data migration4. observe data recovery rate5. observe slow requests 63. replace disk be careful ensure replica-domains weight unchangedotherwise data(pg) migrate to another replica-domain 64. monitoring diamond: add new collector, ceph perf dump, ceph status graphite: store data grafana: display alert: zabbixceph health command 65. throttle model 66. add new collector in diamondredefine metric name in graphite[process].[what].[component].[attr] 67. osd_client_messengerosd_dispatch_clientosd_dispatch_clusterosd_pgosd_pg_client_wosd_pg_client_rosd_pg_client_rwosd_pg_cluster_wfilestore_op_queuefilestore_journal_queuefilestore_journalfilestore_wbfilestore_leveldbfilestore_commitmax_bytesmax_opsopsbytesop/sin_b/sout_b/slat 68. Accidents SSD GC network failure Ceph bug XFS bug SSD corruption PG inconsistent recovery data filled network bandwidth 69. @ UnitedStackTHANK YOUFOR WATCHING2014/11/05 70. https://www.ustack.com/jobs/ 71. About UnitedStack 72. UnitedStack - The Leading OpenStack Cloud ServiceSolution Provider in China 73. VM deployment in secondsWYSIWYG network topologyHigh performance cloud storageBilling by seconds 74. Public ClouddevelBeing1Guangdong1Managed CloudCloud StorageMobile SocailFinancialU CenterUnified Cloud Service PlatformUnified OpsUnified SLAIDCtest 75. The detail of durability formula 76. DataPlacement decides Durability CRUSH-MAP decides DataPlacement CRUSH-MAP decides Durability 77. Default crush settingserver-01rootrack-01server-02server-03server-04server-05server-06server-07server-08server-09rack-02server-10server-11server-12server-13server-14server-15server-16server-17rack-03server-18server-19server-20server-21server-22server-23server-24 78. 3 racks24 nodes72 osdsHow to compute Durability? 79. Ceph Reliability Model 80. https://wiki.ceph.com/Development/Reliability_model CRUSH: Controlled, Scalable, DecentralizedPlacement of Replicated Data Copysets: Reducing the Frequency of Data Loss inCloud Storage Ceph CRUSH code 81. Durability Formula 82. P = func(N, R, S, AFR) P: the probability of losing all copy N: the number of OSD in ceph pool R: the number of copy S: the number of OSD in bucket(it decide recoverytime) AFR: disk annualized failure rate 83. Failure events areconsidered to be Poisson Failure rates are characterized in units of failures perbillion hours(FITs), and so I have tried to represent allperiodicities in FITs and all times in hours:fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/(24*365) Event Probabilities, is the failure rate, theprobability of n failure events during time t:Pn(,t) = (t)n e-t / n! 84. The probability of data loss OSD set: copy set, any PG reside in data loss: any OSD set loss ignore Non-Recoverable Errrors, NREs never happenwhich might be true on scrubbed osd 85. Non-Recoverable ErrorsNREs are read errors that cannot becorrected by retries or ECC. media noise high-fly off-track writes 86. The probability of R OSDs loss1. The probability of an initial OSD loss incident.2. Having sufferred this loss, the probability of losing R-1 OSDs is based on the recovery time.3. Multiplied by the probability of the above. The resultis Pr 87. The probability of Copy setsloss1. M = Copy Sets Number in Ceph Pool2. any R OSDs is C(R, N)3. the probability of copy sets loss is Pr * M / C(R, N) 88. P = Pr * M / C(R, N)If R = 3, One PG - (osd.x, osd.y, osd.z)(osd.x, osd.y, osd.z) is a Copy SetAll Copy Sets are in line with the rule of CRUSH MAPM = The numer of Copy Sets in Ceph PoolPr = The probability of R OSDs lossC(R, N) = any R OSDs in NP = The probability of any Copy Sets loss(data loss) 89. 24 OSDs 24 OSDs 24 OSDsdefault crush settingserver-01rootrack-01server-02server-03server-04server-05server-06server-07server-08server-09rack-02server-10server-11server-12server-13server-14server-15server-16server-17rack-03server-18server-19server-20server-21server-22server-23server-24if R = 3, M = 24 * 24 * 24 = 13824 90. default crush setting 91. AFR = 0.04One Disk Recovery Rate = 100 MB/sMark Out Time = 10 mins 92. N = 72S = 3 R = 1 R = 2 R = 3C(R, N) 72 2556 119280M 72 1728 13824Pr 0.99 2.1*10E-4 4.6*10E-8P 0.99 1.4*10E-4 5.4*10E-9Nines 3 8 93. Trade-off 94. trade off between durability and availabilitynew crush N R S Nines RCeph 72 3 3 11 31 minsCeph 72 3 6 10 13 minsCeph 72 3 12 10 6 minsCeph 72 3 24 9 3 mins 95. Shorter recovery timeMinimize the impact of SLA 96. Final crush mapold map:rootrackhostosdnew map:failure-domainreplica-domainosd-domainosd