hierarchical erasure coding: making erasure coding usable ... · 3 today’s presenters vishnu...

30
PRESENTATION TITLE GOES HERE Hierarchical Erasure Coding: Making Erasure Coding Usable May 14, 2015

Upload: others

Post on 24-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

PRESENTATION TITLE GOES HERE

Hierarchical Erasure Coding: Making Erasure Coding Usable

May 14, 2015

Page 2: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

SNIA Legal Notice

!   The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.

!   Member companies and individual members may use this material in presentations and literature under the following conditions:

!   Any slide or slides used must be reproduced in their entirety without modification !   The SNIA must be acknowledged as the source of any material used in the body of any

document containing material from these presentations. !   This presentation is a project of the SNIA Education Committee. !   Neither the author nor the presenter is an attorney and nothing in this

presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.

!   The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

2

Page 3: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

3

Today’s Presenters

Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management,

Object Storage, NetApp Corporation

Glyn Bowden, SNIA – CSI Board HP Helion Professional Services

Page 4: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

What are we covering and not covering

!   What are we covering !   Hierarchical erasure codes – A combination of an outer code

and an inner code

!   What are we not covering !   Different types of erasure codes !   Tradeoffs between these codes

4

Page 5: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Industry Dynamics

!   Disk drives are increasing in size – reaching double digits sizes in a few years

!   Bit error rates are much worse – !   SMR drive error rates are an order of magnitude worse (10 times worse

than enterprise drives)

!   Drive profiles are changing !   Some can spin up & spin down without increasing failures !   Some are IO limited !   Some are great for sequential IO

!   and the cost is lower

5

Page 6: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

What is erasure coding ?

6

A B C D E F G H I J K L

A B C D E F @ # $

G H I J K L % ^ &

Stripe

3 Parity Fragments 6 Data Fragments

Any 6 fragments re-constitute the original object

Object

Striped and Erasure Encoded

Page 7: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

How is this different from RAID? Example: 6+3 Reed Solomon

7

A B C D E F G H I J K L

A B C D E F @ # $ Stripe

EC ( 3 Parity)

6 Data Fragments

Object

Striped and Erasure Encoded

Erasure coding = RAID*(?) over the network (With some differences)

Dual Parity RAID

RAID

* RAID is a form of EC. This slide implies that they are different. This is not the case. The slide is attempting to explain implications rather than clarify definitions.

Page 8: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Let’s compare Apples to Apples (1/2)

!   Speed of repair is critical: Repair before we lose surviving redundant data.

!   Slower repair rate results in need for greater protection - Increase redundancy by adding parity

!   Speed of repair is dependent on rate at which !   Existing data can be read from surviving drives !   New parity data can be generated !   New parity data can be written back to disk(s)

8

Page 9: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Lets compare Apples to Apples (2/2)

9

Assume !  1Gbps ethernet connection between different disks / nodes !  4TB Drives

Erasure Coding (8+2 over the network)

De-clustered RAID – Dual parity (8+2)

Data read + write Rate MB/s 80% of 1 Gbps ~ 100MB/s 800 MB/s * (~ 60 disk pool)

Data reconstruction speed MB/s 10 MB/s 80 MB/s*

Total data to be reconstructed 4TB 4TB

Time to reconstruct the data (Hours) 111 Hrs 13 Hrs

Mean time to data loss ~ 229,000 years ~ 1.8 million years

8 times better

•  At similar parity levels Erasure coding has •  Worse time to data loss •  Generates an immense amount of network traffic and resulting system latency

8 times better

* Based on actual data. 64 Disk pool, 778 MB/s throughput, 99.64 MB/s repair speed

Page 10: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

But, isn’t erasure coding supposed to be better than RAID ?

!   It is.. !   Increasing parity dramatically increases reliability E.g 8+2 to 6+3 (~ triple parity RAID)

!   And it is not..

!   Extremely large network load impacts !   Latency !   System throughput & performance

!   Memory & compute intensive server side processing !   System reliability is subject to law of diminishing returns

!   Erasure coding fits !   Extremely low IOPS data sets !   With a high bandwidth networking connection, !   and significant host side resources

10

Page 11: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

But this is not the full story..

!   How many copies of your data do you store?

a)  1 b)  2 c)  3+

11

Page 12: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Multi-datacenter use cases

12

Site - A Site - B Site - C

Intra site: EC (6+3) Inter site: Full copy Overhead: 3x (1.5 + 1.5) 50% Overhead: 6+3 50% Overhead: 6+3

25% Overhead: 8+2 25% Overhead: 8+2

Intra site: RAID (8+2) Inter site: Full copy Overhead: 2.5x (1.25 + 1.25)

Site - A Site - B Site - C

Intra site: - Inter site: EC (6+3) Overhead: 1.5x (0.5 + 0.5 + 0.5) 50% Parity Data

•  EC across multiple datacenters is the key use case •  Network rebuild: orders of magnitude more expensive, Data access requires Wan access

and makes latencies worse , Significant availability challenges across multiple datacenters

Page 13: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Hierarchical EC – The next generation

13

Object

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

Hie

rarc

hica

l EC

Top Level EC

2nd Level EC

FLAT

EC

© Copyright NetApp Inc

Top Level EC

Page 14: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Hierarchical EC – The next generation

14

Object Top Level EC

2nd Level EC

© Copyright NetApp Inc

Create a logical separation between single site failure domains and multi-site issues

Sufficient redundancy to resolve majority of disk loss events at node level - Node level erasure coding

Protects against rare and catastrophic failures

Page 15: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Why Hierarchical EC

!   Repair where it is cheapest - Prevent local failures from becoming systemic problems

!   Benefits !   Reduction in overall rebuild traffic !   Dramatically lower repair traffic over the WAN !   Better support for small object sizes !   Better latencies !   Better storage efficiency for similar reliability !   Operational tasks are much simpler unlike flat codes !   System reliability is mostly constant

15 © Copyright NetApp Inc

Page 16: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Repair where it is cheapest & quickest

16

Object

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

Hie

rarc

hica

l EC

Top Level EC

2nd Level EC

FLAT

EC

Local repairs with Node level EC

Remote repairs for catastrophic failures

Massive reduction in repair traffic

© Copyright NetApp Inc

Page 17: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Support for smaller object sizes & more number of objects

17

Object

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

Hie

rarc

hica

l EC

Top Level EC

2nd Level EC

FLAT

EC

1 MB object = 9 x 111 KB Chunk

1 MB object = 3 x 333 KB Chunk

1/3 smaller objects, 3x number of objects

© Copyright NetApp Inc

Page 18: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Better read latencies – due to WAN performance

18

Object

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

Hie

rarc

hica

l EC

Top Level EC

2nd Level EC

FLAT

EC

•  Probability of slow nodes = 1/3 •  Slow node latency = 100ms

Application Visible Latency = 100ms

Application Visible Latency = 25ms

100ms 100ms 25ms

25ms

© Copyright NetApp Inc

Page 19: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Better reliability for similar storage efficiency

19

Object

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

Hie

rarc

hica

l EC

Top Level EC

FLAT

EC

Overhead = 1.5x

Overhead = 1.5 x 1.17 = 1.76

© Copyright NetApp Inc

2nd Level EC

Page 20: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Flat codes do not reflect real world failure domains

20

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD FLAT

EC

Overhead = 1.5x

© Copyright NetApp Inc

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

!   In the event of a site loss !  Loss of even ONE drive results in DATA LOSS !  Minimum usable spread is 7+5

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

JBOD JBOD JBOD

FLAT

EC

Overhead = 1.71x

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

6+3

7+5

Page 21: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Better reliability & storage efficiency tradeoffs

21 © Copyright NetApp Inc

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

JBOD JBOD JBOD

FLAT

EC

Overhead = 1.71x

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

7+5 EC Disk Reconstructed @ 8MB/s

MTTDL = 21 Million billion years

•  MTTDL for one stripe •  Number of stripes = (failure domains / node * number of nodes) 60^12 ? •  More stripes = Increased probability of data loss •  Actual MTTDL much lower (Refer: Usenix 2013 paper on Copysets: Reducing the frequency of data loss in cloud storage”)

Page 22: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Better reliability & storage efficiency tradeoffs

22 © Copyright NetApp Inc

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

JBOD JBOD JBOD

FLAT

EC

Overhead = 1.71x

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

Object

Hie

rarc

hica

l EC

Top Level EC: 6+3

2nd Level EC (17+3)

Overhead = 1.5 x 1.17 = 1.76

MTTDL = 4 Billion years

MTTDL = 21 Million billion years

7+5 EC

Disk Reconstructed @ 80MB/s

Disk Reconstructed @ 11MB/s

Disk Reconstructed @ 8MB/s

MTTDL = 130,000 better than flat

Page 23: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Write Availability Designing for Availability & Reliability at the same time is hard

!   Availability determined by stripe width !   If any one of the 9 nodes suffers a temporary outage, all writes

fail

!   Example Workaround !   Write to 6+1 nodes and DEFER 6+3 !   Dramatically affects MTBF

23 © Copyright NetApp Inc

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

MTTDL – 138 Years Disk Reconstruction @ 11MB/s

MTTDL – 540,000 Years Disk Reconstruction @ 11MB/s

FLAT

EC

Page 24: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Write Availability - Hierarchical Designing for Availability & Reliability at the same time is hard

!   Hierarchical separation enables ‘Site level’ availability domains

24 © Copyright NetApp Inc

Object Top Level EC: 6+3

2nd Level EC (17+3)

Disk Reconstructed @ 80MB/s

Disk Reconstructed @ 11MB/s

Per site “Availability” domains

Hie

rarc

hica

l EC

Page 25: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Growing storage pools is hard (1/3)

!   Consider !   Single node with 60 x 4TB Drives !   Node size = 240TB !   Number of sites = 3 !   Infrastructure size = 2PB

!   The below is perfect fit

25 © Copyright NetApp Inc

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

FLAT

EC

Page 26: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Growing storage pools is hard (2/3)

!   Consider !   What if the Infrastructure has to grow 750TB (3 nodes)

!   EC spread has to change to accommodate the growth !   Wider spread is required: 6+3 becomes 8+4 !   Wider spread is good for reliability – it is bad for everything else !   Wider spread is worse for object size, latency, repair traffic

26 © Copyright NetApp Inc

Object

JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD

240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB 240TB

SITE A SITE B SITE C

JBOD

240TB

JBOD

240TB

JBOD

240TB

FLAT

EC

Page 27: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Growing storage pools is hard (2/3)

!   Hierarchical !   Identical issues for EC only use cases, but !   Multi-use case scenarios become possible: replicated + EC

27 © Copyright NetApp Inc

Object Top Level EC

2nd Level EC

Can scale independent of EC spread

Hie

rarc

hica

l EC

Page 28: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Summary

!   Erasure coding is the future of reliable storage !   In an ideal world, with 0ms latency, infinite bandwidth & CPU – wide spreading flat codes are

fantastic

!   Multi-site use cases are key for erasure coding !   Majority of storage overhead required for site failure protection – Can’t get away from it (e.g

2x for two site, 1.5x for 3 site)

!   Hierarchical erasure coding solves real world problems better than flat erasure codes in multiple dimensions !   Prevent local failures from becoming systemic problems !   Dramatically lower WAN repair traffic !   Support smaller object sizes !   Better latencies !   Better storage efficiency & reliability tradeoffs !   System reliability is mostly constant !   Enable support for multiple use cases

28 © Copyright NetApp Inc

Page 29: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

After This Webcast

!   This webcast and a copy of the slides will be posted to the SNIA Cloud Storage Initiative (CSI) website and available on-demand ! http://www.snia.org/forum/csi/knowledge/webcasts

!   A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-CSI blog ! http://www.sniacloud.com/

!   Please rate this presentation !   Follow us!

!   @SNIAcloud !   @vardhanv !   @bowdengl

29

Page 30: Hierarchical Erasure Coding: Making Erasure Coding Usable ... · 3 Today’s Presenters Vishnu Vardhan (@vardhanv) Sr. Mgr, Product Management, Object Storage, NetApp Corporation

Conclusion

Thank You

30