erasure coding in windows azure storage - snia · erasure coding in windows azure storage cheng...
TRANSCRIPT
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Erasure Coding in Windows Azure Storage
Cheng Huang Microsoft Corporation
Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Outline
Introduction to Windows Azure Storage (WAS)
Conventional Erasure Coding in WAS
Local Reconstruction Coding in WAS
2
North America Region Europe Region Asia Pacific Region
3
Major datacenter
CDN PoPs
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Windows Azure Storage
Abstractions Blobs – File store in the cloud CDN – High performance file delivery through proximity caching Drives – Durable NTFS volumes for Windows Azure applications Tables – Massively scalable NoSQL storage Queues – Reliable storage and delivery of messages
Easy client access Easy to use REST APIs and Client Libraries Existing NTFS APIs for Windows Azure Drives
4
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Massive Distributed Storage Systems in Cloud
Failures are norm rather than exception As the number of components increase, so does the probability of
failure Redundancy is necessary to cope with failures Replication vs. Erasure Coding?
5
nMTTFMTTF OneFirst /=
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Replication vs. Erasure Coding
a=2
b=3
a=2
b=3 b=3
a=2
a+b=5
replication erasure coding
a=2
b=3
a=2
6
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Replication vs. Erasure Coding
a=2
b=3
a=2
b=3 b=3
a=2
a+b=5
replication erasure coding
a=2
b=3 a=2, b=3
a=2
b=3
2a+b=7
7
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
WAS Stream Layer
Append-only distributed file system Provides intra-stamp redundancy Streams are very large files
Has file system like namespace Ordered list of pointers to extents
Extents Unit of replication Sequence of blocks
8
Storage Location Service
Storage Stamp
Partition Layer
Stream Layer
Intra-stamp replication
Storage Stamp
Partition Layer
Stream Layer
Intra-stamp replication
Inter-stamp (Geo)
replication
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Replication and Lazy Erasure Coding
Extents triple replicated when first created while being appended
9
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Replica Placement in a Storage Stamp
10
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Replication and Lazy Erasure Coding
Extents triple replicated when first created while being appended
Extents lazily erasure coded in the background extents sealed at around 3GB when erasure coding finishes, full replicas are deleted policies to choose between replication, erasure coding, or a mix
11
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Conventional Erasure Coding – RS 6+3
sealed extent ( 3 GB )
d0 d1 d2 d3 d4 d5
p1
p2
p3
12
Reed-Solomon 6+3
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Erasure Coding Flow
SM SM SM
Paxos
EN 1 EN 2 EN 3 EN
start erasure coding an extent
ack with metadata
……
13
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Space Savings (RS 6+3 over 3-replication)
Savings
Replicated
Data Fragments
Code Fragments
14
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
How to Further Reduce Storage Cost?
15
d0 d1 d2 d3 d4 d5 p0 p1 p2
sealed extent ( 3 GB )
overhead
(6+3)/6 = 1.5x
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
p0 p1 p2 p3
How to Further Reduce Storage Cost?
16
d0 d1 d2 d3 d4 d5 p0 p1 p2
sealed extent ( 3 GB )
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
overhead
(6+3)/6 = 1.5x
(12+4)/12 = 1.33x
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Challenge
17
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
p0
p1
p3
p2 reconstruction read
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Reconstruction Read
SM SM SM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
reconstruction read
data read
18
……
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Challenge
19
reconstruction read is expensive – reading d0 during unavailability
requiring 12 fragments (12 disk I/Os, 12 net transfers)
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
p0
p1
p3
p2
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Reconstruction Read – When?
Load balancing avoid hot storage node serve reads via reconstruction
Rolling upgrade
Transient unavailability and permanent failures
20
can we achieve1.33x overhead while requiring only 6 fragments for reconstruction?
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Opportunity
Conventional EC all failures are equal same reconstruction cost, regardless of failure #
Cloud storage Prob(1 failure) >> Prob(2 or more failures)
21
optimize erasure coding for cloud storage making single failure reconstruction most efficient
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
y0 y1 y2 y3 y4 y5
Local Reconstruction Code
22
x0 x1 x2 x3 x4 x5
• LRC12+2+2: 12 data fragments, 2 local parities and 2 global parities
• storage overhead: (12 + 2 + 2) / 12 = 1.33x • Local parity: reconstruction requires only 6 fragments
sealed extent ( 3 GB )
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
One More Thing – Ensuring Reliability in LRC
LRC12+2+2 needs to recover arbitrary 3 failures as many 4 failures as possible
23
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
Recover 3 Failures
1st example
24
recover y1 from py (group y)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
Recover 3 Failures
25
recover y1 from py (group y) recover x0 and x2 from q0 and q1
1st example
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
Recover 3 Failures
2nd example
26
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
2nd example
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q'0
q'1
py px
Recover 3 Failures
27
cancel all the yi’s from global parities q’0 and q’
1
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
Recover 4 Failures
1st example
28
py
3 parities cannot recover 4 failures
py is redundant
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
2nd example
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
Recover 4 Failures
29
Maximal Recoverability swap failure with local parity count remaining failures and global parities
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Properties of LRC
Achieving maximal recoverability LRC12+2+2: arbitrary 3 failures and 86% of 4 failures reliability: RS12+4 > LRC12+2+2 > RS6+3
technical paper published at USENIX ATC 2012
Requiring minimum storage overhead, given reconstruction cost fault tolerance
technical paper to appear in IEEE Trans. on Information Theory
30
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cost & Performance Analysis
Vary LRC parameters trade-off points in 3D space storage overhead reconstruction cost reliability (MTTDL)
Reliability is a hard requirement set MTTDL3-replication as target reduce trade-off space to 2D
31
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
2
4
6
8
10
12
1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Reco
nstr
uctio
n Re
ad C
ost
Storage Overhead
Reed-Solomon
LRC
32
same overhead
half cost (6 3)
same cost
1.5x 1.33x
LRC (12+4+2)
LRC (12+2+2)
RS6+3
RS10+4
• RS10+4: HDFS-RAID at Facebook
• RS6+3: GFS II (Colossus)
at Google
RS12+4
LRC vs. Reed-Solomon
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Choice of Windows Azure Storage
33
RS (6 + 3)
reconstruction cost = 6
LRC (14 + 2 + 2)
reconstruction cost = 7
RS (14 + 4)
reconstruction cost = 14
14% savings
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Encoding/Decoding Complexity
Optimized erasure coding implementation homomorphic transformation convert complex finite field matrix operations
to XOR’ing large data blocks
XOR scheduling further improves throughput
Encoding/decoding overhead negligible, compared with network/disk overheads
34
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Additional Lessons
I/O scheduling reconstruction/recovery/on-demand traffic prioritized and
throttled carefully erasure encoding rate kept up with data injection rate
Data consistency checksum handoff and verification between all levels scrub periodically
35
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary
Erasure coding enables significant storage cost savings in Windows Azure Storage with higher reliability than 3-replication
LRC achieves additional 14% savings without compromising performance
Windows Azure Storage Team Blog http://blogs.msdn.com/b/windowsazurestorage/
36