Download - XenSummit - 08/28/2012
![Page 1: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/1.jpg)
virtual machine block storage withthe ceph distributed storage system
sage weilxensummit – august 28, 2012
![Page 2: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/2.jpg)
outline
● why you should care● what is it, what it does● how it works, how you can use it
● architecture● objects, recovery
● rados block device● integration● path forward
● who we are, why we do this
![Page 3: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/3.jpg)
why should you care about anotherstorage system?
requirements, time, cost
![Page 4: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/4.jpg)
requirements
● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?
● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance
![Page 5: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/5.jpg)
time
● ease of administration● no manual data migration, load balancing● painless scaling
● expansion and contraction● seamless migration
![Page 6: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/6.jpg)
cost
● low cost per gigabyte● no vendor lock-in
● software solution● run on commodity hardware
● open source
![Page 7: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/7.jpg)
what is ceph?
![Page 8: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/8.jpg)
8
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
![Page 9: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/9.jpg)
open source
● LGPLv2● copyleft● free to link to proprietary code
● no copyright assignment● no dual licensing● no “enterprise-only” feature set
● active community● commercial support available
![Page 10: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/10.jpg)
distributed storage system
● data center (not geo) scale● 10s to 10,000s of machines● terabytes to exabytes
● fault tolerant● no SPoF● commodity hardware
– ethernet, SATA/SAS, HDD/SSD– RAID, SAN probably a waste of time, power, and money
![Page 11: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/11.jpg)
object storage model
● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy
● objects● trillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)
![Page 12: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/12.jpg)
object storage cluster
● conventional client/server model doesn't scale● server(s) become bottlenecks; proxies are inefficient● if storage devices don't coordinate, clients must
● ceph-osds are intelligent storage daemons● coordinate with peers● sensible, cluster-aware protocols● sit on local file system
– btrfs, xfs, ext4, etc.– leveldb
![Page 13: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/13.jpg)
13
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
![Page 14: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/14.jpg)
Monitors:
• Maintain cluster state
• Provide consensus for distributed decision-making
• Small, odd number
• These do not serve stored objects to clients
•
M
OSDs:• One per disk or RAID group• At least three in a cluster• Serve stored objects to
clients• Intelligently peer to perform
replication tasks
![Page 15: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/15.jpg)
M
M
M
HUMAN
![Page 16: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/16.jpg)
data distribution
● all objects are replicated N times● objects are automatically placed, balanced, migrated
in a dynamic cluster● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers
● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it
![Page 17: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/17.jpg)
CRUSH• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Ensures even distribution
• Repeatable, deterministic
• Rule-based configuration
• specifiable replication
• infrastructure topology aware
• allows weighting
• Stable mapping
• Limited data migration
![Page 18: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/18.jpg)
distributed object storage
● CRUSH tells us where data should go● small “osd map” records cluster state at point in time● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution
● object storage daemons (RADOS)● store it there● migrate it as the cluster changes
● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● efficient data access● the illusion of a single copy with consistent behavior
![Page 19: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/19.jpg)
large clusters aren't static
● dynamic cluster● nodes are added, removed; nodes reboot, fail, recover● recovery is the norm
● osd maps are versioned● shared via gossip
● any map update potentially triggers data migration● ceph-osds monitor peers for failure● new nodes register with monitor● administrator adjusts weights, mark out old hardware, etc.
![Page 20: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/20.jpg)
![Page 21: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/21.jpg)
![Page 22: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/22.jpg)
CLIENT
??
![Page 23: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/23.jpg)
what does this mean for my cloud?
● virtual disks● reliable● accessible from many hosts
● appliances● great for small clouds● not viable for public or (large) private clouds
● avoid single server bottlenecks● efficient management
![Page 24: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/24.jpg)
M
M
M
VM
LIBRADOS
LIBRBD
VIRTUALIZATION CONTAINER
![Page 25: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/25.jpg)
LIBRADOS
M
M
M
LIBRBD
CONTAINER
LIBRADOS
LIBRBD
CONTAINERVM
![Page 26: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/26.jpg)
LIBRADOS
M
M
M
KRBD (KERNEL MODULE)
HOST
![Page 27: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/27.jpg)
RBD: RADOS Block Device
• Replicated, reliable, high-performance virtual disk
• Allows decoupling of VMs and containers
• Live migration!
• Images are striped across the cluster
• Snapshots!
• Native support in the Linux kernel
• /dev/rbd1• librbd allows easy
integration
![Page 28: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/28.jpg)
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
![Page 29: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/29.jpg)
144 0 0 0 0 = 144
instant copy
![Page 30: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/30.jpg)
4144
CLIENT
write
write
write
= 148
write
![Page 31: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/31.jpg)
4144
CLIENTread
read
read
= 148
![Page 32: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/32.jpg)
current RBD integration
● native Linux kernel support● /dev/rbd0, /dev/rbd/<poolname>/<imagename>
● librbd● user-level library
● Qemu/KVM● links to librbd user-level library
● libvirt● librbd-based storage pool● understands RBD images● can only start KVM VMs... :-(
● CloudStack, OpenStack
![Page 33: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/33.jpg)
what about Xen?
● Linux kernel driver (i.e. /dev/rbd0) ● easy fit into existing stacks● works today● need recent Linux kernel for dom0
● blktap● generic kernel driver, userland process● easy integration with librbd● more featureful (cloning, caching), maybe faster● doesn't exist yet!
● rbd-fuse● coming soon!
![Page 34: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/34.jpg)
libvirt
● CloudStack, OpenStack● libvirt understands rbd images, storage pools
● xml specifies cluster, pool, image name, auth
● currently only usable with KVM● could configure /dev/rbd devices for VMs
![Page 35: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/35.jpg)
librbd
● management● create, destroy, list, describe images● resize, snapshot, clone
● I/O● open, read, write, discard, close
● C, C++, Python bindings
![Page 36: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/36.jpg)
RBD roadmap
● locking● fence failed VM hosts
● clone performance● KSM (kernel same-page merging) hints● caching
● improved librbd caching● kernel RBD + bcache to local SSD/disk
![Page 37: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/37.jpg)
why
● limited options for scalable open source storage ● proprietary solutions
● marry hardware and software● expensive● don't scale (out)
● industry needs to change
![Page 38: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/38.jpg)
who we are
● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)● growing user and developer community● we are hiring
● C/C++/Python developers● sysadmins, testing engineers● Los Angeles, San Francisco, Sunnyvale, remote
http://ceph.com/
![Page 39: XenSummit - 08/28/2012](https://reader033.vdocuments.net/reader033/viewer/2022061205/54809a53b4af9fe2158b5dd7/html5/thumbnails/39.jpg)
39
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT