high performance ceph jun park (adobe) and dan …schd.ws/.../4b/ceph_deep_dive-s1.pdf · high...
TRANSCRIPT
Caveats
§ Relative meaning of “High Performance”
§ Still on a long journey; Evolving
§ Not large scale; unlike CERN
§ Possibly opinionated in some cases
2
How To Evaluate Storage?
4
CapacityIOPS
(Bandwidth)
Typically, hit IOPS as a first bottleneck with HDDs
* IOPS: I/O operations per second, regardless of block size
Durability
Availability
Path For IOPS
5
Physical StoreHDD or SSD Network Compute Host
VM
IOPS(almost bandwidth :)
Bandwidth (Throughput) = IOPS X Block Size
Relationship Between IOPS and Bandwidth
6
Bandwidth
Block Size Block Size
IOPS
4KB 8KB 16KB 4096KB
Block Size
4KB 8KB 16KB 4096KB
Bandwidth (Throughput) = IOPS X Block Size
In some cases, flat!Then , you got double of bandwidth
Ceph With Replication
§ Distribution Algorithm: Straw -> Tree
7
Rack1 Rack2 Rack3 Rack4
3 Replicas
Data node 1
Data node 2
Data node 3
16 osds x 4 racks x 4 data nodes x 2TB / 3 replicas = 170 TB (Effective Disk Capacity for users)
Ceph Architecture
8
CephData
Nodes
VLAN100: Ceph Public
VLAN200: Ceph Cluster
Compute
CephMonitorNodes
2 x 10G 2 x 10G 2 x 10G
Write Operation With Journaling
10
SSDs
Drives, e.g. SAS 2TB drives
<Ceph1>:~# lscpu | egrep 'Thread|Core|Socket|^CPU\(’CPU(s): 48 Thread(s) per core: 2Core(s) per socket: 12Socket(s): 2
Consistent Hashing In Ceph
§ Advantages§ No need to store metadata explicitly§ Fast
§ Disadvantages§ Overhead of rebalance§ Operational difficulties in dealing with edge cases
§ E.g., Swift, Cassandra, Amazon Dynamo, and so on.
11
Network Bandwidth Impact (rados bench write)
12
0 500 1000 1500 2000
10 Gbps Interface
20 Gbps Interface
1165
1953
Bandwidth (MB/s) For Write
Same as in our LABwith 20G(due to smaller # of data nodes)
Different I/O Patterns With 128K Block Size
13
0 1000 2000 3000 4000 5000 6000 7000 8000
RadonWrite
SeqWrite
SeqRead
W25R75
604
364
978
233
4828
2913
7827
1864
Block Size: 128K
IOPS Bandwidth (MB/s)
ß For write
Random Writes of 3 VMS On The Same Compute
14
0
50
100
150
200
250
300
350
VM1 VM2 VM3
318 313 325
Bandwidth (MB/s)
• Block Size: 64KB• System Wide Max Performance with other traffic: Max 40,000 IOPS,
1367 MB/s Write
0
500
1000
1500
2000
2500
3000
VM1 VM2 VM3
2547 2505 2600
IOPS
Pain Points In Production
15
Computes vs. Data nodesRatio?
Upgrading? E.g., DecaPodSeparated from OpenStack
Operational Overheads?E.g., adding more data nodes
-> creating internal trafficDeep Scrubbing
Pin-pointBottlenecks?
QoS?
Pleasure Points§ Generic Architecture§ With high tech such as NVMe SSDs, immediately
improved
§ Various use cases
§ Good community§ Open & stable
§ Work well with OpenStack
§ Truly Scale Out§ High Performance with low cost
16
Next Generation Ceph: BlueStore
18
SSDs
Drives, e.g. SAS 2TB drives
BlueStore: Optimal Key-Value Store,
No POSIX, etc.
Next Steps?
§ NVMe (Non-volatile memory Express) SSD
§ Ceph Cashing Tier
§ RDMA (Remote Direct Memory Access)
§ BlueStore
19