native erasure coding support inside hdfs presentation

116
HDFS Erasure Coding Zhe Zhang [email protected]

Upload: lin-bao

Post on 07-Jan-2017

435 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Native erasure coding support inside hdfs presentation

HDFS Erasure CodingZhe Zhang

[email protected]

Page 2: Native erasure coding support inside hdfs presentation

Replication is Expensive

Page 3: Native erasure coding support inside hdfs presentation

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

Page 4: Native erasure coding support inside hdfs presentation

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

§ 200% storage overhead

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

Page 5: Native erasure coding support inside hdfs presentation

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

§ 200% storage overhead§ Secondary replicas rarely accessed

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

Page 6: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage

Page 7: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 0Replication:XOR Coding: 1 0

Page 8: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0

Page 9: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0

2 extra bits

Page 10: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits

Page 11: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Page 12: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Page 13: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

§ Half the storage overhead

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Page 14: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

§ Half the storage overhead§ Slower recovery

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Page 15: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage

Page 16: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC

Page 17: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)

- A PB of new data every 1~2 days - All “sealed” data stored in EC

Page 18: Native erasure coding support inside hdfs presentation

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)

- A PB of new data every 1~2 days - All “sealed” data stored in EC

§ Google File System - Large portion of data stored in EC

Page 19: Native erasure coding support inside hdfs presentation

Roadmap

Page 20: Native erasure coding support inside hdfs presentation

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

Page 21: Native erasure coding support inside hdfs presentation

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

Page 22: Native erasure coding support inside hdfs presentation

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Page 23: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Page 24: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication:

Page 25: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication:

Page 26: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication: Data Durability = 2

Page 27: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication: Data Durability = 2

Page 28: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

useful data

3-way Replication: Data Durability = 2

redundant data

Page 29: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

useful data

3-way Replication: Data Durability = 2

Storage Efficiency = 1/3 (33%)

redundant data

Page 30: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Page 31: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Page 32: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Y = 0 ⊕ 1 = 1

Page 33: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Y = 0 ⊕ 1 = 1

Page 34: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

useful data redundant data

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Page 35: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

Storage Efficiency = 2/3 (67%)

useful data redundant data

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Page 36: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):

Page 37: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):

Page 38: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)

Page 39: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)

Very flexible!

Page 40: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Page 41: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency

Page 42: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica

Page 43: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0

Page 44: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%

Page 45: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication

Page 46: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2

Page 47: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%

Page 48: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells

Page 49: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1

Page 50: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%

Page 51: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3)

Page 52: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3

Page 53: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%

Page 54: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4)

Page 55: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4

Page 56: Native erasure coding support inside hdfs presentation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%

Page 57: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

128~256MFile 0~128M … 640~768M0~128M 128~256M

Page 58: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

128~256MFile … 640~768M

0~128M

bloc

k 0

DataNode 0

0~128M 128~256M

Page 59: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

Page 60: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5

Page 61: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Page 62: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

Page 63: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

Data Locality !

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

Page 64: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

Data Locality !

Small Files "

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

Page 65: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~128M 128~256M

Page 66: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~1M 1~2M 5~6M

0~128M 128~256M

Page 67: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~1M 1~2M 5~6M6~7M

0~128M 128~256M

Page 68: Native erasure coding support inside hdfs presentation

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

Striped Layout:0~1M 1~2M 5~6M6~7M

Data Locality "

Small Files !

Parallel I/O !

0~128M 128~256M

Page 69: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

Spectrum:

ReplicationErasureCoding

Striping

Contiguous

Ceph

Ceph

Quancast File System

Quancast File System

HDFS Facebook f4

Windows Azure

Page 70: Native erasure coding support inside hdfs presentation

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Page 71: Native erasure coding support inside hdfs presentation

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

Page 72: Native erasure coding support inside hdfs presentation

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

Page 73: Native erasure coding support inside hdfs presentation

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

40.08%36.03%

23.89%

2.03%11.38%

86.59% file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

Page 74: Native erasure coding support inside hdfs presentation

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

40.08%36.03%

23.89%

2.03%11.38%

86.59% file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

3.20%

20.75%

76.05%

0.00%0.36%

99.64%file count

space usage

Dominated by small files

small medium large

Cluster C Profile

Page 75: Native erasure coding support inside hdfs presentation

Choosing Block Layout

Striping

Contiguous

Replication Erasure Coding

Phase 1.1

Phase

1.2

Phase 2 (Future work)

Phase 3 (Future work)

CurrentHDFS

Page 76: Native erasure coding support inside hdfs presentation

Generalizing Block NameNode

Page 77: Native erasure coding support inside hdfs presentation

Generalizing Block NameNodeMapping Logical and Storage Blocks

Page 78: Native erasure coding support inside hdfs presentation

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Page 79: Native erasure coding support inside hdfs presentation

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Hierarchical Naming Protocol:

Page 80: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Page 81: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Page 82: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Page 83: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Page 84: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Page 85: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Page 86: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Page 87: Native erasure coding support inside hdfs presentation

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Page 88: Native erasure coding support inside hdfs presentation

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Page 89: Native erasure coding support inside hdfs presentation

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Page 90: Native erasure coding support inside hdfs presentation

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Page 91: Native erasure coding support inside hdfs presentation

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

parity

Page 92: Native erasure coding support inside hdfs presentation

Reconstruction on DataNode§ Important to avoid delay on the critical path

- Especially if original data is lost § Integrated with Replication Monitor

- Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms

§ New ErasureCodingWorker component on DataNode

Page 93: Native erasure coding support inside hdfs presentation

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Page 94: Native erasure coding support inside hdfs presentation

Acceleration with Intel ISA-L§ 1 legacy coder

- From Facebook’s HDFS-RAID project § 2 new coders

- Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Page 95: Native erasure coding support inside hdfs presentation

Microbenchmark: Codec Calculation

Page 96: Native erasure coding support inside hdfs presentation

Microbenchmark: HDFS I/O

Page 97: Native erasure coding support inside hdfs presentation

Conclusion

Page 98: Native erasure coding support inside hdfs presentation

Conclusion§ Erasure coding expands effective storage space by ~50%!

Page 99: Native erasure coding support inside hdfs presentation

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout

Page 100: Native erasure coding support inside hdfs presentation

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):

- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)

Page 101: Native erasure coding support inside hdfs presentation

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):

- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)

§ Phase II will support contiguous block layout for better locality

Page 102: Native erasure coding support inside hdfs presentation

Acknowledgements§ Cloudera

- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel

- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks

- Jing Zhao, Tsz Wo Nicholas Sze § Huawei

- Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan)

- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Page 103: Native erasure coding support inside hdfs presentation

Just merged to trunk!

Page 104: Native erasure coding support inside hdfs presentation

Questions?

Just merged to trunk!

Page 105: Native erasure coding support inside hdfs presentation

Questions?

Just merged to trunk!

Erasure Coding: A type of Error Correction Coding

Page 106: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

Spectrum:

Page 107: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

128~256MFile 0~128M … 640~768M

Page 108: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

128~256MFile 0~128M … 640~768M

Page 109: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

Small Files "

128~256MFile 0~128M … 640~768M

Page 110: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

Small Files "

128~256MFile … 640~768M

Page 111: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Page 112: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

Page 113: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

Small Files !

Page 114: Native erasure coding support inside hdfs presentation

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

Small Files !

Parallel I/O !

Page 115: Native erasure coding support inside hdfs presentation

Client Parallel Writing

blockGroup

DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4

DFSStripedOutputStream

dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4

blk_1009 blk_1010 blk_1011 blk_1012 blk_1013

Coordinator

allocate new blockGroup

Page 116: Native erasure coding support inside hdfs presentation

Client Parallel Reading

Stripe 0

Stripe 1

Stripe 2

DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3

(parity blocks)(data blocks)

all zero all zero

requested

requested requested requested

requested

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read