extra material - research school of computer science | · a bigdata tour – hdfs, ceph and ......
Post on 16-Apr-2018
232 Views
Preview:
TRANSCRIPT
16/05/15
1
A BigData Tour – HDFS, Ceph and MapReduce
These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial
EXTRA MATERIAL
16/05/15
2
CEPH – A HDFS REPLACEMENT
What is Ceph?
• Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware
ARCHITECTURAL COMPONENTS
12 Copyright © 2014 by Inktank | Private and Confidential
APP HOST/VM CLIENT
16/05/15
3
Ceph Architecture – Host Level
• At the host level… • We have Object Storage Devices (OSDs) and Monitors • Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) • The device, host, rack, row, and room are stored by the Monitors and used to compute
a failure domain • OSDs store the Ceph data objects • A host can run multiple OSDs, but it needs to be appropriately provisioned
OBJECT STORAGE DAEMONS
14
btrfs xfs ext4
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – Block Level
• At the block device level... • Object Storage Device (OSD) can be an entire drive, a partition, or
a folder • OSDs must be formatted in ext4, XFS, or btrfs (experimental).
connect.linaro.org
At the block device level…
● Object Storage Device (OSD) can be an entire drive, a partition, or a folder.
● OSDs must be formatted in ext4, XFS, or btrfs (experimental).
Lightning Introduction to Ceph Architecture (2)
Drive/Partition
FilesystemOSD
Pools
Drive/Partition
FilesystemOSD
Drive/Partition
FilesystemOSD
Drive/Partition
FilesystemOSD
6
https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913
16/05/15
4
Ceph Architecture – Data Organization Level
• At the data organization level… • Data are partitioned into pools • Pools contain a number of Placement Groups (PGs) • Ceph data objects map to PGs (via a modulo of hash of name) • PGs then map to multiple OSDs.
connect.linaro.org
At the data organization level...● Data are partitioned into pools.● Pools contain a number of Placement Groups (PGs).● Ceph data objects map to PGs (via a modulo of hash of name).● PGs then map to multiple OSDs.
Lightning Introduction to Ceph Architecture (3)
Pool: mydataobjobj PG #1
PG #2obj obj
OSD
OSD
OSD
OSD
7
https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913
Ceph Placement Groups • Ceph shards a pool into placement
groups distributed evenly and pseudo-randomly across the cluster
• The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs—creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object
• The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs
• This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail
storage cluster can grow, shrink and recover from failure efficiently.
The following diagram depicts how CRUSH assigns objects to placement groups, and placement
groups to OSDs.
If a pool has too few placement groups relative to the overall cluster size, Ceph will have too much
data per placement group and won’t perform well. If a pool has too many placement groups relative
to the overall cluster, Ceph OSDs will use too much RAM and CPU and won’t perform well. Setting
an appropriate number of placement groups per pool, and an upper limit on the number of
placement groups assigned to each OSD in the cluster is critical to Ceph performance.
4. CRUSH
Ceph assigns a CRUSH ruleset to a pool. When a Ceph client stores or retrieves data in a pool,
Ceph identifies the CRUSH ruleset, a rule and the top-level bucket in the rule for storing and
retrieving data. As Ceph processes the CRUSH rule, it identifies the primary OSD that contains the
placement group for an object. That enables the client to connect directly to the OSD and read or
write data.
To map placement groups to OSDs, a CRUSH map defines a hierarchical list of bucket types (i.e.,
under types in the generated CRUSH map). The purpose of creating a bucket hierarchy is to
segregate the leaf nodes by their failure domains and/or performance domains, such drive type,
hosts, chassis, racks, power distribution units, pods, rows, rooms, and data centers.
With the exception of the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
you may define it according to your own needs if the default types don’t suit your requirements.
CRUSH supports a directed acyclic graph that models your Ceph OSD nodes, typically in a
hierarchy. So you can support multiple hierarchies with multiple root nodes in a single CRUSH map.
For example, you can create a hierarchy of SSDs for a cache tier, a hierarchy of hard drives with
SSD journals, etc.
5. I/O OPERATIONS
CHAPTER 1. STORAGE CLUSTER ARCHITECTURE
7
RedHat Ceph Architecture v1.2.3
16/05/15
5
Ceph Architecture – Overall View Software: Ceph
Rados
MDS MDS.1
MDS.n
......
MONs
MON.1
MON.n
......
Pool 1
Pool 2
Pool n ..... .....
Pool X
CRUSH map
PG 1 PG 2 PG 3 PG 4 PG n .........
1 n
Cluster Node [OSDs]
... 1 n
Cluster Node [OSDs]
... 1 n
Cluster Node [OSDs]
... .........
LibRados
RadosGW RBD CephFS
APP HOST / VM Client
S3 Swift
https://www.terena.org/activities/tf-storage/ws16/slides/140210-low_cost_storage_ceph-openstack_swift.pdf
RADOS CLUSTER
15
RADOS CLUSTER
Ceph Architecture – RADOS • An Application interacts with a
RADOS cluster
• RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects
• On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or Rados Block Device) and a cache-coherent distributed file system (CephFS).
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
6
Ceph Architecture – RADOS Components RADOS COMPONENTS
16
OSDs: ! 10s to 10000s in a cluster ! One per disk (or one per SSD, RAID group…) ! Serve stored objects to clients ! Intelligently peer for replication & recovery
Monitors: ! Maintain cluster membership and state ! Provide consensus for distributed decision-making ! Small, odd number ! These do not serve stored objects to clients
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – Where Do Objects Live? WHERE DO OBJECTS LIVE?
17
??
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
7
Ceph Architecture – Where Do Objects Live?
• Contact a Metadata server?
A METADATA SERVER?
18
1
2
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – Where Do Objects Live?
• Or calculate the placement via static mapping?
CALCULATED PLACEMENT
19
A-G
H-N
O-T
U-Z
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
8
Ceph Architecture – CRUSH Maps EVEN BETTER: CRUSH!
20
RADOS CLUSTER
*) Controlled Replication Under Scalable Hashing
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – CRUSH Maps
• Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing)
• CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters)
• The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store
EVEN BETTER: CRUSH!
20
RADOS CLUSTER
*) Controlled Replication Under Scalable Hashing
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
9
Ceph Architecture – CRUSH – 1/2 CRUSH IS A QUICK CALCULATION
21
RADOS CLUSTER
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – CRUSH – 2/2 CRUSH: DYNAMIC DATA PLACEMENT
22
CRUSH: ! Pseudo-random placement algorithm
! Fast calculation, no lookup ! Repeatable, deterministic
! Statistically uniform distribution ! Stable mapping
! Limited data migration on change ! Rule-based configuration
! Infrastructure topology aware ! Adjustable replication ! Weighting
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
10
Ceph Architecture – librados LIBRADOS: RADOS ACCESS FOR APPS
25
LIBRADOS: ! Direct access to RADOS for applications ! C, C++, Python, PHP, Java, Erlang ! Direct access to storage nodes ! No HTTP overhead
ACCESSING A RADOS CLUSTER
24
RADOS CLUSTER
socket
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – RADOS Gateway
THE RADOS GATEWAY
27
RADOS CLUSTER
socket
REST
RADOSGW MAKES RADOS WEBBY
28
RADOSGW: ! REST-based object storage proxy ! Uses RADOS to store objects ! API supports buckets, accounts ! Usage accounting for billing ! Compatible with S3 and Swift applications
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
11
Ceph Architecture – RADOS Block Device (RBD) – 1/3 RBD STORES VIRTUAL DISKS
33
RADOS BLOCK DEVICE: ! Storage of disk images in RADOS ! Decouples VMs from host ! Images are striped across the cluster (pool) ! Snapshots ! Copy-on-write clones ! Support in:
! Mainline Linux Kernel (2.6.39+) ! Qemu/KVM, native Xen coming soon ! OpenStack, CloudStack, Nebula, Proxmox
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – RADOS Block Device (RBD) – 2/3
• Virtual Machine storage using RDB
• Live Migration using RBD
STORING VIRTUAL DISKS
30
RADOS CLUSTER SEPARATE COMPUTE FROM STORAGE
31
RADOS CLUSTER
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
12
Ceph Architecture – RADOS Block Device (RBD) – 3/3
• Direct host access from Linux
KERNEL MODULE FOR MAXIMUM FLEXIBILITY
32
RADOS CLUSTER
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
Ceph Architecture – CephFS – POSIX F/S
SEPARATE METADATA SERVER
35
RADOS CLUSTER
data metadata
SCALABLE METADATA SERVERS
36
METADATA SERVER ! Manages metadata for a POSIX-compliant shared
filesystem ! Directory hierarchy ! File metadata (owner, timestamps, mode, etc.)
! Stores metadata in RADOS ! Does not serve file data to clients ! Only required for shared filesystem
http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf
16/05/15
13
Ceph – Read/Write Flows
https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction
Ceph Replicated I/O
With the ability to perform data replication on behalf of Ceph clients, Ceph OSD Daemons relieve
Ceph clients from that duty, while ensuring high data availability and data safety.
Note
The primary OSD and the secondary OSDs are typically configured to be in separate
failure domains (i.e., rows, racks, nodes, etc.). CRUSH computes the ID(s) of the
secondary OSD(s) with consideration for the failure domains.
5.2. Erasure-coded I/O
Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write
operations. In replicated pools, Ceph makes a deep copy of each object in the placement group on
the secondary OSD(s) in the set. For erasure coding, the process is a bit different. An erasure
coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks.
The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting
set. The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for
encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for
maintaining an authoritative version of the placement group logs.
For instance an erasure coded pool is created to use five OSDs (K+M = 5) and sustain the loss of
two of them (M = 2).
When the object NYAN containing ABCDEFGHI is written to the pool, the erasure encoding function
splits the content into three data chunks simply by dividing the content in three: the first contains
ABC, the second DEF and the last GHI. The content will be padded if the content length is not a
CHAPTER 1. STORAGE CLUSTER ARCHITECTURE
9
RedHat Ceph Architecture v1.2.3
16/05/15
14
Ceph – Erasure Coding – 1/5
• Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes.
• Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K
• E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like:
• RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content.
https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction
Ceph – Erasure Coding – 2/5
• Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations
• In replicated pools, Ceph makes a deep copy of each object in the placement group on the secondary OSD(s) in the set
• For erasure coding, the process is a bit different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set.
• The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs.
https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction
16/05/15
15
Ceph – Erasure Coding – 3/5
• 5 OSDs (K+M=5); sustain loss of 2 (M=2)
• Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K
• Coding blocks are YXY and QGC
multiple of K. The function also creates two coding chunks: the fourth with YXY and the fifth with
GQC. Each chunk is stored in an OSD in the acting set. The chunks are stored in objects that have
the same name (NYAN) but reside on different OSDs. The order in which the chunks were created
must be preserved and is stored as an attribute of the object (shard_t), in addition to its name.
Chunk 1 contains ABC and is stored on OSD5 while chunk 4 contains YXY and is stored on OSD3.
When the object NYAN is read from the erasure coded pool, the decoding function reads three
chunks: chunk 1 containing ABC, chunk 3 containing GHI and chunk 4 containing YXY. Then, it
rebuilds the original content of the object ABCDEFGHI. The decoding function is informed that the
chunks 2 and 5 are missing (they are called erasures). The chunk 5 could not be read because the
OSD4 is out. The decoding function can be called as soon as three chunks are read: OSD2 was the
slowest and its chunk was not taken into account.
5.3. Cache-Tier I/O
A cache tier provides Ceph clients with better I/O performance for a subset of the data stored in a
backing storage tier. Cache tiering involves creating a pool of relatively fast/expensive storage
devices (e.g., solid state drives) configured to act as a cache tier, and a backing pool of either
erasure-coded or relatively slower/cheaper devices configured to act as an economical storage tier.
Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture
10
RedHat Ceph Architecture v1.2.3
Ceph – Erasure Coding – 4/5
• On reading object NYAN from an erasure coded pool, decoding function retrieves chunks 1, 2, 3 and 4
• If any two chunks are missing (ie an erasure is present), decoding function can reconstruct other chunks
multiple of K. The function also creates two coding chunks: the fourth with YXY and the fifth with
GQC. Each chunk is stored in an OSD in the acting set. The chunks are stored in objects that have
the same name (NYAN) but reside on different OSDs. The order in which the chunks were created
must be preserved and is stored as an attribute of the object (shard_t), in addition to its name.
Chunk 1 contains ABC and is stored on OSD5 while chunk 4 contains YXY and is stored on OSD3.
When the object NYAN is read from the erasure coded pool, the decoding function reads three
chunks: chunk 1 containing ABC, chunk 3 containing GHI and chunk 4 containing YXY. Then, it
rebuilds the original content of the object ABCDEFGHI. The decoding function is informed that the
chunks 2 and 5 are missing (they are called erasures). The chunk 5 could not be read because the
OSD4 is out. The decoding function can be called as soon as three chunks are read: OSD2 was the
slowest and its chunk was not taken into account.
5.3. Cache-Tier I/O
A cache tier provides Ceph clients with better I/O performance for a subset of the data stored in a
backing storage tier. Cache tiering involves creating a pool of relatively fast/expensive storage
devices (e.g., solid state drives) configured to act as a cache tier, and a backing pool of either
erasure-coded or relatively slower/cheaper devices configured to act as an economical storage tier.
Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture
10
RedHat Ceph Architecture v1.2.3
16/05/15
16
Ceph – Erasure Coding – 4/5
• 5 OSDs (K+M=5); sustain loss of 2 (M=2)
• Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K
• Coding blocks are YXY and QGC
multiple of K. The function also creates two coding chunks: the fourth with YXY and the fifth with
GQC. Each chunk is stored in an OSD in the acting set. The chunks are stored in objects that have
the same name (NYAN) but reside on different OSDs. The order in which the chunks were created
must be preserved and is stored as an attribute of the object (shard_t), in addition to its name.
Chunk 1 contains ABC and is stored on OSD5 while chunk 4 contains YXY and is stored on OSD3.
When the object NYAN is read from the erasure coded pool, the decoding function reads three
chunks: chunk 1 containing ABC, chunk 3 containing GHI and chunk 4 containing YXY. Then, it
rebuilds the original content of the object ABCDEFGHI. The decoding function is informed that the
chunks 2 and 5 are missing (they are called erasures). The chunk 5 could not be read because the
OSD4 is out. The decoding function can be called as soon as three chunks are read: OSD2 was the
slowest and its chunk was not taken into account.
5.3. Cache-Tier I/O
A cache tier provides Ceph clients with better I/O performance for a subset of the data stored in a
backing storage tier. Cache tiering involves creating a pool of relatively fast/expensive storage
devices (e.g., solid state drives) configured to act as a cache tier, and a backing pool of either
erasure-coded or relatively slower/cheaper devices configured to act as an economical storage tier.
Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture
10
RedHat Ceph Architecture v1.2.3
top related