ceph: a scalable, high-performance distributed file system
DESCRIPTION
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrel D. E. Long. Ceph: A Scalable, High-Performance Distributed File System. Contents. Goals System Overview Client Operation Dynamically Distributed Metadata Distributed Object Storage Performance. Goals. Scalability - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/1.jpg)
1
Ceph: A Scalable, High-Performance Distributed File
SystemSage A. Weil, Scott A. Brandt,
Ethan L. Miller, Darrel D. E. Long
![Page 2: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/2.jpg)
2
Contents
• Goals
• System Overview
• Client Operation
• Dynamically Distributed Metadata
• Distributed Object Storage
• Performance
![Page 3: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/3.jpg)
3
Goals
• Scalability– Storage capacity, throughput, client
performance. Emphasis on HPC.
• Reliability– “…failures are the norm rather than the
exception…”
• Performance– Dynamic workloads
![Page 4: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/4.jpg)
4
![Page 5: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/5.jpg)
5
![Page 6: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/6.jpg)
6
System Overview
![Page 7: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/7.jpg)
7
Key Features
• Decoupled data and metadata– CRUSH
• Files striped onto predictably named objects• CRUSH maps objects to storage devices
• Dynamic Distributed Metadata Management– Dynamic subtree partitioning
• Distributes metadata amongst MDSs
• Object-based storage– OSDs handle migration, replication, failure detection
and recovery
![Page 8: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/8.jpg)
8
Client Operation
• Ceph interface– Nearly POSIX– Decoupled data and metadata operation
• User space implementation– FUSE or directly linked
FUSE is a software allowing to implement a file system in a user space
![Page 9: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/9.jpg)
9
Client Access Example
1. Client sends open request to MDS
2. MDS returns capability, file inode, file size and stripe information
3. Client read/write directly from/to OSDs
4. MDS manages the capability
5. Client sends close request, relinquishes capability, provides details to MDS
![Page 10: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/10.jpg)
10
Synchronization
• Adheres to POSIX
• Includes HPC oriented extensions– Consistency / correctness by default– Optionally relax constraints via extensions– Extensions for both data and metadata
• Synchronous I/O used with multiple writers or mix of readers and writers
![Page 11: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/11.jpg)
11
Distributed Metadata
• “Metadata operations often make up as much as half of file system workloads…”
• MDSs use journaling– Repetitive metadata updates handled in
memory– Optimizes on-disk layout for read access
• Adaptively distributes cached metadata across a set of nodes
![Page 12: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/12.jpg)
12
Dynamic Subtree Partitioning
![Page 13: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/13.jpg)
13
Distributed Object Storage
• Files are split across objects
• Objects are members of placement groups
• Placement groups are distributed across OSDs.
![Page 14: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/14.jpg)
14
Distributed Object Storage
![Page 15: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/15.jpg)
15
CRUSH
• CRUSH(x) (osdn1, osdn2, osdn3)– Inputs
• x is the placement group• Hierarchical cluster map• Placement rules
– Outputs a list of OSDs
• Advantages– Anyone can calculate object location– Cluster map infrequently updated
![Page 16: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/16.jpg)
16
Data distribution
(not a part of the original PowerPoint presentation)
1. Files are striped into many objects(ino, ono) oid
2. Ceph maps objects into placement groups (PGs)
hash(oid) & mask pgid
3. CRUSH assigns placement groups to OSDsCRUSH(pgid) (osd1, osd2)
![Page 17: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/17.jpg)
17
Replication
• Objects are replicated on OSDs within same PG– Client is oblivious to replication
![Page 18: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/18.jpg)
18
Failure Detection and Recovery
• Down and Out
• Monitors check for intermittent problems
• New or recovered OSDs peer with other OSDs within PG
![Page 19: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/19.jpg)
19
Conclusion
• Scalability, Reliability, Performance
• Separation of data and metadata– CRUSH data distribution function
• Object based storage
![Page 20: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/20.jpg)
20
Per-OSD Write Performance
![Page 21: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/21.jpg)
21
EBOFS Performance
![Page 22: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/22.jpg)
22
Write Latency
![Page 23: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/23.jpg)
23
OSD Write Performance
![Page 24: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/24.jpg)
24
Diskless vs. Local Disk
Compare latencies of (a) a MDS where all metadata are stored in a shared OSD cluster and (b) a MDS which has a local disk containing its journaling
![Page 25: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/25.jpg)
25
Per-MDS Throughput
![Page 26: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/26.jpg)
26
Average Latency
![Page 27: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/27.jpg)
27
Lessons learned
(not a part of the original PowerPoint presentation)1. Replacing file allocation metadata with a
globally known distribution function was a good idea• Simplified our design
2. We were right not to use an existing kernel file system for local object storage
3. The MDS load balancer has an important impact on overall system scalability but deciding which mtadata to migrate where is a difficult task
4. Implementing the client interface was more difficult than expected• Idiosyncrasies of FUSE
![Page 28: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/28.jpg)
28
Related Links
• OBFS: A File System for Object-based Storage Devices– ssrc.cse.ucsc.edu/Papers/wang-mss04b.pdf
• OSD– www.snia.org/tech_activities/workgroups/osd/
• Ceph Presentation– http://institutes.lanl.gov/science/institutes/current/ComputerScience/ISSDM-07-26-2006-Bra
ndt-Talk.pdf
– Slides 4 and 5 from Brandt’s presentation
![Page 29: Ceph: A Scalable, High-Performance Distributed File System](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815864550346895dc5c2ee/html5/thumbnails/29.jpg)
29
Acronyms
• CRUSH: Controlled Replication Under Scalable Hashing• EBOFS: Extent and B-tree based Object File System• HPC: High Performance Computing• MDS: MetaData server• OSD: Object Storage Device• PG: Placement Group• POSIX: Portable Operating System Interface for uniX• RADOS: Reliable Autonomic Distributed Object Store