Download - vBACD July 2012 - Scaling Storage with Ceph
SCALING STORAGE WITH CEPH
Ross Turk, Inktank
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
IN THE BEGINNING Magic Madzik, Flickr / CC BY 2.0
EARLY INFORMATION STORAGE Chico.Ferreira, Flickr / CC BY 2.0
WRITING > CAVE PAINTINGS kevingessner, Flickr / CC BY-SA 2.0
x1000
== x1
PEOPLE BEGIN WRITING A LOT Moyan_Brenn, Flickr / CC BY-ND 2.0
WRITING IS T IME-‐CONSUMING trekkyandy, Flickr / CC BY 2.0
THE INDUSTRIALIZATION OF WRITING FateDenied, Flickr / CC BY 2.0
x1000
== x1
+ magnet = tape magnetic tape
STORAGE BECOMES MECHANICAL Erik Pitti, Wikipedia / CC BY-ND 2.0
HUMAN COMPUTER TAPE
HUMAN ROCK
HUMAN
INK
PAPER
COMPUTERS NEED PEOPLE TO WORK USDAgov, Flickr / CC BY 2.0
HUMAN COMPUTER TAPE
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011
==
THROUGHPUT BECOMES IMPORTANT Zane Luke, Flickr / CC BY-ND 2.0
LAZ0R B3AMS CHANGE EVERYTHING!! Jeff Kubina, Flickr / CC-BY-SA 2.0
HARD DRIVES ARE TOTALLY BETTER
amazing spinny hard drives sucky stupid tape slow
EVERYTHING GETS MESSY Rob!, Flickr / CC BY 2.0
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 011 db
owner: rturk created: aug12
last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101
10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
file
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 db 01 10
WE OUTGROW THE HARD DRIVE Mr. T in DC, Flickr / CC BY 2.0
HUMAN COMPUTER DISK
DISK
DISK
DISK
DISK
DISK
DISK
PEOPLE NEED S IMULTANEOUS ACCESS wFourier, Flickr / CC BY 2.0
HUMAN COMPUTER DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
(COMPUTER)
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN HUMAN
HUMAN
HUMAN HUMAN
HUMAN HUMAN
HUMAN
HUMAN HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN (actually more like this…)
DISK COMPUTER
HUMAN
HUMAN
HUMAN
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 011 db X
pace: quick driver: frog
license: expired expression: agog
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
object
DISK COMPUTER
APP
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK
COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
COMPUTER
DISK
DISK
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
COMPUTER
VM
VM
VM
STORAGE THROUGHOUT H ISTORY Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.
Writing
Computers
Shared storage
Distributed storage
Cloud computing
Ceph
Painting
DISK COMPUTER
HUMAN
HUMAN
HUMAN
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
HUMAN
HUMAN
HUMAN
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
STORAGE APPLIANCES Michael Moll, Wikipedia / CC BY-SA 2.0
6.4 MILL ION SQFT OF FACTORIES Dude94111, Flickr / CC BY 2.0
STORAGE VENDORS HAVE BIG BILLS CarbonNYC, Flickr / CC BY 2.0
STORAGE APPLIANCES ARE EXPENSIVE 401K 2012, Flickr / CC BY-SA 2.0
TECHNOLOGY IS A COMMODITY RaeAllen, Flickr / CC-BY 2.0
COMMODITY PRICES FLUCTUATE
May-07 May-08 May-09 May-10 May-11 May-12
GROWING WITH HARDWARE APPLIANCES
§ First PB § Proprietary
storage hardware
§ Well-known storage vendor
§ $14 b’zillion
§ Second PB § Proprietary
storage hardware
§ Same storage vendor
§ Another $14 b’zillion
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
APPLIANCES ARE OLD TECHNOLOGY Paul Keller, Flickr / CC BY 2.0
Source: http://www.cpubenchmark.net/high_end_cpus.html
FLAGSHIP HARDWARE APPLIANCE
Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++ X
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
HUMAN [DEVELOPER]
!!
THE WORLD NEEDS
A STORAGE TECHNOLOGY THAT
SCALES INFINITELY
THE WORLD NEEDS
A STORAGE TECHNOLOGY THAT DOESN’T REQUIRE
AN INDUSTRIAL
MANUFACTURING PROCESS
SAGE WEIL
§ Co-founder of DreamHost
§ Inventor of Ceph
§ CEO of Inktank
OPEN SOURCE
philosophy design
OPEN SOURCE SPREADS IDEAS orchidgalore, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
philosophy design
WE ARE SMARTER TOGETHER rturk, Linkedin Inmap
CEPH BELONGS TO ALL OF US wackybadger, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
philosophy design
CEPH IS BUILT TO SCALE
Too much for a book
Too much for a drive
Too much for a computer
Too much for a room
Ceph
Too much for a cave
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
philosophy design
ARILOMAX CALIFORNICUS aroid, Flickr / CC BY 2.0
THE OCTOPUS (A METAPHOR) I love speaking in metaphors.
single point of failure
highly-available replicated
THE BEEHIVE (ANOTHER METAPHOR) blumenbiene, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASED
philosophy design
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++ ✔
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASED
SELF-MANAGING
philosophy design
DISKS = JUST T INY RECORD PLAYERS jon_a_ross, Flickr / CC BY 2.0
D
55 times / day
= D
D D
x 1 MILLION
D D
D D
IT ALL STARTED WITH A DREAM
+
NEW MONTHLY CODE COMMITS
0
100
200
300
400
500
600
700
2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07
CEPH STARTS POPPING UP!
(sorry about all the logo tampering)
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FS FS btrfs xfs ext4
M M M
M
M
M
HUMAN
Monitors: § Maintain cluster map § Provide consensus for
distributed decision-making
§ Must have an odd number § These do not serve stored
objects to clients
M
OSDs: § One per disk
(recommended) § At least three in a cluster § Serve stored objects to
clients § Intelligently peer to perform
replication tasks § Supports object classes
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
LIBRADOS
M
M
M
APP
native
L
87
LIBRADOS § Provides direct access to
RADOS for applications § C, C++, Python, PHP,
Java § No HTTP overhead
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
M
M
M
native
REST
APP
LIBRADOS RADOSGW
LIBRADOS RADOSGW
APP
RADOS Gateway: § REST-based interface to
RADOS § Supports buckets,
accounting § Compatible with S3 and
Swift applications
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
M
M
M
VM
LIBRADOS LIBRBD
VIRTUALIZATION CONTAINER
LIBRADOS
M
M
M
LIBRBD CONTAINER
LIBRADOS LIBRBD
CONTAINER VM
LIBRADOS
M
M
M
KRBD (KERNEL MODULE) HOST
RADOS Block Device: § Storage of virtual disks in
RADOS § Allows decoupling of VMs
and containers § Live migration!
§ Images are striped across the cluster
§ Boot support in QEMU, KVM, and OpenStack Nova
§ Mount support in the Linux kernel
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
M
M
M
CLIENT
01 10
data metadata
Metadata Server § Manages metadata for a
POSIX-compliant shared filesystem § Directory hierarchy § File metadata (owner,
timestamps, mode, etc.) § Stores metadata in RADOS § Does not serve file data to
clients § Only required for shared
filesystem
WHAT MAKES CEPH UNIQUE?
HOW DO YOU F IND YOUR KEYS? azmeen, Flickr / CC BY 2.0
APP ??
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
APP
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
A-G
H-N
O-T
U-Z
F*
I ALWAYS PUT MY KEYS ON THE HOOK vitamindave, Flickr / CC BY 2.0
APP
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
DEAR DIARY: KEYS = IN THE KITCHEN Barnaby, Flickr / CC BY 2.0
HOW DO YOU FIND YOUR KEYS
WHEN YOUR HOUSE IS
INFINITELY BIG AND
ALWAYS CHANGING?
THE ANSWER: CRUSH!! pasukaru76, Flickr / CC SA 2.0
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
CRUSH § Pseudo-random placement
algorithm § Ensures even distribution § Repeatable, deterministic § Rule-based configuration
§ Replica count § Infrastructure topology § Weighting
CLIENT
??
CLIENT
??
LIBRADOS
M
M
M
VM
LIBRBD VIRTUALIZATION CONTAINER
HOW DO YOU SPIN UP
THOUSANDS OF VMs INSTANTLY
AND EFFICIENTLY?
144 0 0 0 0
instant copy
= 144
4 144
CLIENT
write
write
write
= 148
write
4 144
CLIENT read
read
read
= 148
HOW DO YOU MANAGE
DIRECTORY HEIRARCHY WITHOUT
A SINGLE POINT OF FAILURE?
FILESYSTEMS REQUIRE METADATA Barnaby, Flickr / CC BY 2.0
M
M
M
CLIENT
01 10
M
M
M
one tree
three metadata servers
??
DYNAMIC SUBTREE PARTITIONING
AND NOW BACKPEDALING
ALMOST EVERYTHING
WORKS
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLY AWESOME
AWESOME AWESOME
AWESOME
AWESOME
LAN SCALE!! *
* OR REALLY REALLY SCARY FAST WAN
CEPH AND CLOUDSTACK tableatny, Flickr / CC BY 2.0
RBD SUPPORT IN CLOUDSTACK
§ Just announced two weeks ago! § Allows storage of virtual disks inside RADOS
§ Works with KVM only right now § No volume snapshots yet
§ Requires the latest version of, um, everything § More information can be found on the mailing list:
§ ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505