using openstack swift for extreme data durability
Post on 13-Jul-2015
542 Views
Preview:
TRANSCRIPT
Using OpenStack Swift for extreme data durability
Florent Flament, CloudwattChristian Schwede, eNovance
OpenStack Summit Paris, November 2014
Intro - Cloudwatt● Florent Flament
● Dev & Fireman @ Cloudwatt
● Fixing & tuning of OpenStack (Cinder, Keystone, Nova, Swift)
● Email: florent.flament@cloudwatt.com
● IRC: florentflament on #openstack-dev (Freenode)
● Twitter: @florentflament_
● Blogs: http://dev.cloudwatt.com / http://www.florentflament.com
Intro - eNovance● Christian Schwede
● Developer @ eNovance / Red Hat
● Mostly working on Swift, testing, automation and developer tools
● Swift Core
● IRC: cschwede in #openstack-swift
● christian@enovance.com / cschwede@redhat.com
● Twitter: @cschwede_de
Proxy Node
Proxy Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Proxy Node
Proxy Node
Network
Zone 0 Zone 1 Zone 2
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Proxy Node
Proxy Node
Network
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
Ring : the Map of data● One file per type of data. Ring files map each copy of a
data to a physical device through partitions.
● An object’s partition number is computed from the hash
of the object’s name.
● A Ring file is: a (replica, partition) to device ID table, a
devices table and a number of hash bits.
● Visualize a Ring: https://github.com/victorlin/swiftsense
Concrete example of Ring
0 1 2 3 0 1 2 3
1 2 3 0 1 2 3 0
2 3 0 1 2 3 0 1
Partition number
0
1
2
Rep
lica
num
ber
0 1 2 3 4 5 6 7
Replica & Partition to Device ID table Devices table
ID Host Port Device
0 192.168.0.10 6000 sdb1
1 192.168.0.10 6000 sdc1
2 192.168.0.11 6000 sdb1
3 192.168.0.11 6000 sdc1
Bit count (partition power) = 3→ 23 = 8 partitions
Storage policies● Included in the Juno release (Swift > 2.0.0)
● Applied on a per-container basis
● Flexibility to use multiple rings, for example:
○ Basic: 2 replicas on spinning disks, single datacenter
○ Strong: 3 replicas in three different datacenters around the globe
○ Fast: 3 replicas on SSDs and much more powerful proxies
Object durability● Disk failures: pd ~ 2-5% per year
● Unrecoverable bit read errors: pb = 10-15 ⋅ 8 ⋅ objectsize
3 replicas 2 replicas 1 replica Data loss
Failure Failure Failure
Replication ReplicationReplication
● Durability in the range of 10-11 nines with 3 replicas (99.99999999%)
● http://enovance.github.io/swift-durability-calculator/
Object availability & durability
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
Maintainability by Simplicity● Standalone `swift-ring-builder` tool to manipulate the Ring
○ Uses `builders` files to keep architectural information on the cluster
○ Smartly assigns partitions to devices
○ Generates Ring files easily checked
● Processes on Swift nodes focus on ensuring that files are stored
uncorrupted at the appropriate location
Splitting a running Swift Cluster● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Natively available in Swift
Splitting a running Swift Cluster● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Small stepsNew in Swift 2.2 !!
Example of process:
1. Add devices to new region with a very low weight
2. Increase devices’ weights to store 5% of data in the new region
3. Progressively increase by steps of 5% the amount of data in the new region
More details: http://www.florentflament.com/blog/splitting-swift-cluster.html
Add a new region smoothly by limiting the amount of data moved
● really possible since Swift 2.2
● Final weight in new region should be at least ⅓ of the total cluster weight
Adding a new region
Erasure coding● Coming real soon now
● Instead of N copies of each object:
○ apply EC to object, split into multiple fragments, for example 14
○ store them on different disks/nodes
○ objects can be rebuild from 10 fragments
■ Tolerates loss of 4 fragments
● higher durability
■ Only ~ 40% overhead (compared to 200%)
● much cheaper
Durability calculation● More detailed calculation
○ Number of disks, servers, partitions
● Add erasure coding
● Include in Swift documentation?
● Community effort
○ Discussion started last Swift hackathon
■ NTT, Swiftstack, IBM, Seagate, Red Hat / eNovance
○ Ad-Hoc session on Thursday/Friday - join us!
Summary● High availability, even if large parts of the cluster are not accessible
● Automatic failure correction ensures high durability, and depending on
your cluster configuration excels known industry standards
● Swift 2.2 (Juno release)
○ Even smoother and predictable cluster upgrades
○ Storage Policies allow fine grained data placement control
● Erasure Coding increase durability even more while lowering costs
top related