debunking the myths of hdfs erasure coding performance
TRANSCRIPT
Debunking the Myths ofHDFS Erasure Coding Performance
HDFS inherits 3-way replication from Google File System- Simple, scalable and robust
200% storage overhead Secondary replicas rarely accessed
Replication is Expensive
Erasure Coding Saves Storage Simplified Example: storing 2 bits
Same data durability- can lose any 1 bit
Half the storage overhead Slower recovery
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage Facebook
- f4 stores 65PB of BLOBs in EC Windows Azure Storage (WAS)
- A PB of new data every 1~2 days- All “sealed” data stored in EC
Google File System- Large portion of data stored in EC
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture Hardware-accelerated Codec Framework Performance Evaluation
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X Y⊕0 0 00 1 11 0 11 1 0
Y = 0 1 = 1⊕
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)Very flexible!
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%
EC in Distributed StorageBlock Layout:
Data Locality 👍🏻Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128M
bloc
k 0
DataNode 0
128~256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…bl
ock
5
DataNode 5 DataNode 6
…
parity
Striped Layout:0~1M 1~2M 5~6M6~7M
Data Locality 👎🏻
Small Files 👍🏻Parallel I/O 👍🏻
0~128M 128~256M
EC in Distributed Storage
Spectrum:
Replication ErasureCoding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4Windows Azure
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture Hardware-accelerated Codec Framework Performance Evaluation
Choosing Block Layout• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38%2.03%
23.89%36.03% 40.08%
file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
CurrentHDFS
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator
Client Parallel Reading
…
parity
Reconstruction on DataNode Important to avoid delay on the critical path
- Especially if original data is lost Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks- New priority algorithms
New ErasureCodingWorker component on DataNode
Data Checksum Support Supports getFileChecksum for EC striped mode files
- Comparable checksums for same content striped files- Can’t compare the checksums for contiguous file and striped file- Can reconstruct on the fly if found block misses while computing
Planning to introduce new version of getFileChecksum- To achieve comparable checksums between contiguous and striped file
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture Hardware-accelerated Codec Framework Performance Evaluation
Acceleration with Intel ISA-L 1 legacy coder
- From Facebook’s HDFS-RAID project 2 new coders
- Pure Java — code improvement over HDFS-RAID- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Why is ISA-L Fast?
pre-computed and reused
parallel operation
Direct ByteBuffer
Microbenchmark: Codec Calculation
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
DFSIO / MapReduce
Hive-on-MR — locality sensitive
Hive-on-Spark — locality sensitive
Conclusion Erasure coding expands effective storage space by ~50%! HDFS-EC phase I implements erasure coding in striped block layout Upstream effort (HDFS-7285):
- Design finalized Nov. 2014- Development started Jan. 2015- 218 commits, ~25k LoC change- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn
Phase II will support contiguous block layout for better locality
Acknowledgements Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus Intel
- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze Huawei
- Vinayakumar B, Walter Su, Xinwei Qin Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Questions?
Zhe Zhang, [email protected] | @oldcaphttp://zhe-thoughts.github.io/
Uma Gangumalla, [email protected]
@UmaMaheswaraG
http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software”• Intel powers analytics solutions that are optimized
for performance and security from silicon to software
• Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services
• Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ
LinkedIn Hadoop
Dali: LinkedIn’s Logical Data Access Layer for
Hadoop
Meetup Thu 6/306~9PM @LinkedIn
2nd floor, Unite room2025 Stierlin CtMountain View
Dr. Elephant: performance
monitoring and tuning.SFHUG in Aug
Backup