a tale of two erasure codes in hdfs -...
TRANSCRIPT
![Page 1: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/1.jpg)
Dynamo
A Tale of Two Erasure Codes in HDFS
1
Mingyuan Xia*, Mohit Saxena+ ,
Mario Blaum+, and David A. Pease+
*McGill University, +IBM Research Almaden
FAST’15
何军权 2015-04-30
![Page 2: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/2.jpg)
22
Outline
Introduction & Motivation
Design
Evaluation
Conclustions
Related work
![Page 3: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/3.jpg)
33
Introduction & Motivation
![Page 4: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/4.jpg)
4
Big Data Storage
Reliability and Availability
Replication: 3-way replication
Erasure Code: Reed-Solomon(RS), LRC
4
GFS3-way replication
3x, 2003
FB HDFSRS, 1.4x, 2011
GFS v2RS, 1.5x, 2012
AzureLRC, 1.33x, 2012
FB HDFSLRC, 1.66x, 2013
![Page 5: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/5.jpg)
5
Popular Erasure Code Families
Product Code(PC)
Local Reconstruction Code(LRC)
Other
5
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1 L2 L3 L4 L5
PC LRC
Reed-Solomon(RS)
a0 a1 a2 a3 a4 ha
b0 b1 b2 b3 b4 hb
P0 P1 P2 P3 P4 h
![Page 6: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/6.jpg)
6
Erasure Code
Facebook HDFS RS(10,4)
Compute 4 parities per 10 data blocks
All blocks store in different storage nodes
Storage Overhead: 1.4x
D10
D1 D2 D3 D4 D5
D6 D7 D8 D9
P1 P2 P3 P4
![Page 7: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/7.jpg)
7
Erasure Code
High Degraded Read Latency
Read to an unavailable block requires
Multiple disk reads, network transfers and compute cycles to
decode
…HDFS
Read
exception
Client
![Page 8: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/8.jpg)
8
Erasure Code
Long Reconstruction Time Facebook's Cluster:
100K blocks lost per day
50 machine-unavailablility events per day
Reconstruction traffic: 180TB per day
…HDFS
Reconstruction Job
![Page 9: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/9.jpg)
9
Erasure Code
Degraded
Read Latency
Recover Cost
Recover Cost: the total number of blocks required to reconstruction a data block after failure
Reconstruction Time
![Page 10: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/10.jpg)
10
Recovery Cost vs. Storage Overhead
Conclusion
Storage Overhead and Reconstruction Cost are a tradeoff in
single erasure code.
FB HDFS RS
GFS v2 RS
Azure LRC
FB HDFS LRC GFS 3-way Repl
![Page 11: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/11.jpg)
1111
How to balance?
Storage Overhead Recovery Cost
![Page 12: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/12.jpg)
12
Data Access Skew
Conclusions Only few data are "hot"
P(freq > 10) ~= 1%
Most data are "cold" P(freq <= 10) ~= 99%
12
![Page 13: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/13.jpg)
13
Data Access Skew
Hot data
High access frequency
A small fraction of data
Cold data
Low access frequency
A major fraction of data
13
A little improvement on read can
gain a high read performance
A few less of data to store can
save huge storage space
Hot Data: Decrease the Recovery Cost
Cold Data: High Storage Efficiency
![Page 14: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/14.jpg)
14
HACFS
System State
Tracks file states
File size, last mTime
Read count and coding state
Adapting Coding
Tracks system states
Choose coding scheme
based on read count and
mTime
Erasure Coding
Providing four coding
interfaces
Encode/Decode
Upcode/Downcode
![Page 15: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/15.jpg)
15
Erasure Coding Algorithms
Two different erasure codes
Fast code:
Encode the frequently accessed blocks to reduce the read latency
and reconstruction time
Provide overall low recovery cost
Compact code:
Encode the less frequently accessed blocks to get low storage
overhead
Maintain a low and bounded storage overhead
15
![Page 16: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/16.jpg)
16
State Transition
3-way
replication
Fast
Code
Compact
Code
Recently
created
HACFS
Write cold
COND'
COND
COND
COND : Read Hot and Bounded
COND': Read Cold or Not Bounded
COND'
![Page 17: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/17.jpg)
17
Fast and Compact Product Codes(1)
17
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
Fast Code
(Product Code 2x5)
Storage overhead: 1.8x
Recovery Cost: 2
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
(Product Code 6x5)
Storage overhead: 1.4x
• ha1=RS(a0,a1,a2,a3,a4)
• Pa0=XOR(a0,a5)
![Page 18: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/18.jpg)
18
Fast and Compact Product Codes(2)
18
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
Fast Code
(Product Code 2x5)
Storage overhead: 1.8x
Recovery Cost: 2
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
(Product Code 6x5)
Storage overhead: 1.4x
Recovery Cost: 5
• P0=XOR(a0,a5,b0,b5,c0,c5)
• ha1=RS(a0,a1,a2,a3,a4)
• Pa0=XOR(a0,a5)
![Page 19: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/19.jpg)
19
Fast and Compact LRC(1)
19
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1 L2 L3 L4 L5
Fast Code
(LRC(12,6,2))
Storage overhead: 20/12=1.67x
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1
Compact Code
(LRC(12,2,2))
Storage overhead: 16/12=1.33x
Recovery Cost: 2 Recovery Cost: 6
{G1,G2}=RS(a0,a1,..,a11)
Li=XOR(ai, ai+6){G1,G2}=RS(a0,a1,..,a11)
Li=RS'(a0, a1, a2, a6, a7, a8)
![Page 20: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/20.jpg)
2020
Upcoding for Product Codes
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
Pb0 Pb1 Pb2 Pb3 Pb4 Phb
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
Pc0 Pc1 Pc2 Pc3 Pc4 Phc
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Fast Code
PC(2x5)
Compact Code
PC(6x5)
• Parities h require no re-construction
• Parities P require no data block transfer
• All parities updates can be done in parallel
![Page 21: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/21.jpg)
2121
Downcoding for Product Codes
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
Pb0 Pb1 Pb2 Pb3 Pb4 Phb
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
Pc0 Pc1 Pc2 Pc3 Pc4 Phc
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
PC(6x5)
Fast Code
PC(2x5)
• Pa0=XOR(a0,a5)
• Pc0=XOR(P0,Pa0,Pb0)
![Page 22: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/22.jpg)
22
Evaluation Platform
CPU: Intel Xeon E5645 24 cores, 2.4GHz
Disk: 7.2K RPM, 6*2TB
Memory: 96GB
Network: 1Gbps NIC
Cluster size: 11 nodes
Workload
22
CC: Cloudera Customer FB: Facebook
![Page 23: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/23.jpg)
23
Evaluation Metrics
Degraded read latency
Foreground read request latency
Reconstruction time
Background recovery for failures
Storage overhead
23
![Page 24: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/24.jpg)
24
The Production systems: 16-21 seconds
HACFS: 10-14 seconds
Degraded Read Latency
Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5
![Page 25: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/25.jpg)
25
A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production
systems
HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12
blocks, but GFS v2 only 6 blocks
Reconstruction Time
![Page 26: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/26.jpg)
26
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
26
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
![Page 27: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/27.jpg)
27
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
27
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
lost block type
HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure
data blockfast: 2 fast: 2
6 10 6comp: 5 comp: 6
global parityfast: 5 fast: 12
6 10 12comp: 6 comp: 12
![Page 28: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/28.jpg)
28
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
28
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
lost block type
HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure
data blockfast: 2 fast: 2
6 10 6comp: 5 comp: 6
global parityfast: 5 fast: 12
6 10 12comp: 6 comp: 12
![Page 29: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/29.jpg)
29
Conclusions
By using Erasure code, a lot of storage space can be
saved.
The production systems using a single erasure code
can not balance the tradeoff between recovery cost
and storage overhead very well.
HACFS by using a dynamically adaptive coding can
provide both low recovery cost and storage overhead.
![Page 30: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/30.jpg)
30
Related Work
f4 OSDI'14
Divide the cold and hot by the data age
XOR-based Erasure Code--FAST’12
Combination RS with XOR.
Minimum-Storage-Regeneration(MSR)
Minimizes network transfers during reconstruction.
Product-Matrix-Reconstruct-By-Transfer(PM-RBT)FAST’15
Optimal in terms of I/O, storage, and network bandwidth.
30
![Page 31: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/31.jpg)
3131
Thank You!
![Page 32: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:](https://reader035.vdocuments.net/reader035/viewer/2022081409/60822799e7c81f6441614698/html5/thumbnails/32.jpg)
32
Acknowledgment
Prof. Xiong
Zigang Zhang
Biao Ma
CAS– ICT – Storage System Group 32