native erasure coding support inside hdfs presentation
TRANSCRIPT
Replication is Expensive
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
§ 200% storage overhead
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
§ 200% storage overhead§ Secondary replicas rarely accessed
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
Erasure Coding Saves Storage
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 0Replication:XOR Coding: 1 0
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0
2 extra bits
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
§ Half the storage overhead
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
§ Half the storage overhead§ Slower recovery
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days - All “sealed” data stored in EC
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days - All “sealed” data stored in EC
§ Google File System - Large portion of data stored in EC
Roadmap
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
redundant data
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Y = 0 ⊕ 1 = 1
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Y = 0 ⊕ 1 = 1
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%
EC in Distributed StorageBlock Layout:
128~256MFile 0~128M … 640~768M0~128M 128~256M
EC in Distributed StorageBlock Layout:
128~256MFile … 640~768M
0~128M
bloc
k 0
DataNode 0
0~128M 128~256M
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed StorageBlock Layout:
Data Locality !
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed StorageBlock Layout:
Data Locality !
Small Files "
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~128M 128~256M
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M
0~128M 128~256M
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M6~7M
0~128M 128~256M
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
Striped Layout:0~1M 1~2M 5~6M6~7M
Data Locality "
Small Files !
Parallel I/O !
0~128M 128~256M
EC in Distributed Storage
Spectrum:
ReplicationErasureCoding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%36.03%
23.89%
2.03%11.38%
86.59% file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%36.03%
23.89%
2.03%11.38%
86.59% file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
3.20%
20.75%
76.05%
0.00%0.36%
99.64%file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
Striping
Contiguous
Replication Erasure Coding
Phase 1.1
Phase
1.2
Phase 2 (Future work)
Phase 3 (Future work)
CurrentHDFS
Generalizing Block NameNode
Generalizing Block NameNodeMapping Logical and Storage Blocks
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
parity
Reconstruction on DataNode§ Important to avoid delay on the critical path
- Especially if original data is lost § Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms
§ New ErasureCodingWorker component on DataNode
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Acceleration with Intel ISA-L§ 1 legacy coder
- From Facebook’s HDFS-RAID project § 2 new coders
- Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Conclusion
Conclusion§ Erasure coding expands effective storage space by ~50%!
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
§ Phase II will support contiguous block layout for better locality
Acknowledgements§ Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze § Huawei
- Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Just merged to trunk!
Questions?
Just merged to trunk!
Questions?
Just merged to trunk!
Erasure Coding: A type of Error Correction Coding
EC in Distributed Storage
Spectrum:
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile … 640~768M
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
Parallel I/O !
Client Parallel Writing
blockGroup
DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4
DFSStripedOutputStream
dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4
blk_1009 blk_1010 blk_1011 blk_1012 blk_1013
Coordinator
allocate new blockGroup
Client Parallel Reading
Stripe 0
Stripe 1
Stripe 2
DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3
(parity blocks)(data blocks)
all zero all zero
requested
requested requested requested
requested
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read