erasure coding in windows azure storage - snia · erasure coding in windows azure storage cheng...

36
Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin

Upload: phamdiep

Post on 12-Oct-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Erasure Coding in Windows Azure Storage

Cheng Huang Microsoft Corporation

Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin

Page 2: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Outline

Introduction to Windows Azure Storage (WAS)

Conventional Erasure Coding in WAS

Local Reconstruction Coding in WAS

2

Page 3: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

North America Region Europe Region Asia Pacific Region

3

Major datacenter

CDN PoPs

Page 4: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Windows Azure Storage

Abstractions Blobs – File store in the cloud CDN – High performance file delivery through proximity caching Drives – Durable NTFS volumes for Windows Azure applications Tables – Massively scalable NoSQL storage Queues – Reliable storage and delivery of messages

Easy client access Easy to use REST APIs and Client Libraries Existing NTFS APIs for Windows Azure Drives

4

Page 5: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Massive Distributed Storage Systems in Cloud

Failures are norm rather than exception As the number of components increase, so does the probability of

failure Redundancy is necessary to cope with failures Replication vs. Erasure Coding?

5

nMTTFMTTF OneFirst /=

Page 6: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Replication vs. Erasure Coding

a=2

b=3

a=2

b=3 b=3

a=2

a+b=5

replication erasure coding

a=2

b=3

a=2

6

Page 7: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Replication vs. Erasure Coding

a=2

b=3

a=2

b=3 b=3

a=2

a+b=5

replication erasure coding

a=2

b=3 a=2, b=3

a=2

b=3

2a+b=7

7

Page 8: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

WAS Stream Layer

Append-only distributed file system Provides intra-stamp redundancy Streams are very large files

Has file system like namespace Ordered list of pointers to extents

Extents Unit of replication Sequence of blocks

8

Storage Location Service

Storage Stamp

Partition Layer

Stream Layer

Intra-stamp replication

Storage Stamp

Partition Layer

Stream Layer

Intra-stamp replication

Inter-stamp (Geo)

replication

Page 9: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Replication and Lazy Erasure Coding

Extents triple replicated when first created while being appended

9

Page 10: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Replica Placement in a Storage Stamp

10

Page 11: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Replication and Lazy Erasure Coding

Extents triple replicated when first created while being appended

Extents lazily erasure coded in the background extents sealed at around 3GB when erasure coding finishes, full replicas are deleted policies to choose between replication, erasure coding, or a mix

11

Page 12: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Conventional Erasure Coding – RS 6+3

sealed extent ( 3 GB )

d0 d1 d2 d3 d4 d5

p1

p2

p3

12

Reed-Solomon 6+3

Page 13: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Erasure Coding Flow

SM SM SM

Paxos

EN 1 EN 2 EN 3 EN

start erasure coding an extent

ack with metadata

……

13

Page 14: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Space Savings (RS 6+3 over 3-replication)

Savings

Replicated

Data Fragments

Code Fragments

14

Page 15: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

How to Further Reduce Storage Cost?

15

d0 d1 d2 d3 d4 d5 p0 p1 p2

sealed extent ( 3 GB )

overhead

(6+3)/6 = 1.5x

Page 16: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

p0 p1 p2 p3

How to Further Reduce Storage Cost?

16

d0 d1 d2 d3 d4 d5 p0 p1 p2

sealed extent ( 3 GB )

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11

overhead

(6+3)/6 = 1.5x

(12+4)/12 = 1.33x

Page 17: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Challenge

17

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11

p0

p1

p3

p2 reconstruction read

Page 18: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Reconstruction Read

SM SM SM

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN

reconstruction read

data read

18

……

Page 19: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Challenge

19

reconstruction read is expensive – reading d0 during unavailability

requiring 12 fragments (12 disk I/Os, 12 net transfers)

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11

p0

p1

p3

p2

Page 20: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Reconstruction Read – When?

Load balancing avoid hot storage node serve reads via reconstruction

Rolling upgrade

Transient unavailability and permanent failures

20

can we achieve1.33x overhead while requiring only 6 fragments for reconstruction?

Page 21: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Opportunity

Conventional EC all failures are equal same reconstruction cost, regardless of failure #

Cloud storage Prob(1 failure) >> Prob(2 or more failures)

21

optimize erasure coding for cloud storage making single failure reconstruction most efficient

Page 22: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

y0 y1 y2 y3 y4 y5

Local Reconstruction Code

22

x0 x1 x2 x3 x4 x5

• LRC12+2+2: 12 data fragments, 2 local parities and 2 global parities

• storage overhead: (12 + 2 + 2) / 12 = 1.33x • Local parity: reconstruction requires only 6 fragments

sealed extent ( 3 GB )

Page 23: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

One More Thing – Ensuring Reliability in LRC

LRC12+2+2 needs to recover arbitrary 3 failures as many 4 failures as possible

23

Page 24: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q0

q1

py px

Recover 3 Failures

1st example

24

recover y1 from py (group y)

Page 25: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q0

q1

py px

Recover 3 Failures

25

recover y1 from py (group y) recover x0 and x2 from q0 and q1

1st example

Page 26: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q0

q1

py px

Recover 3 Failures

2nd example

26

Page 27: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

2nd example

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q'0

q'1

py px

Recover 3 Failures

27

cancel all the yi’s from global parities q’0 and q’

1

Page 28: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q0

q1

py px

Recover 4 Failures

1st example

28

py

3 parities cannot recover 4 failures

py is redundant

Page 29: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

2nd example

y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5

q0

q1

py px

Recover 4 Failures

29

Maximal Recoverability swap failure with local parity count remaining failures and global parities

Page 30: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Properties of LRC

Achieving maximal recoverability LRC12+2+2: arbitrary 3 failures and 86% of 4 failures reliability: RS12+4 > LRC12+2+2 > RS6+3

technical paper published at USENIX ATC 2012

Requiring minimum storage overhead, given reconstruction cost fault tolerance

technical paper to appear in IEEE Trans. on Information Theory

30

Page 31: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Cost & Performance Analysis

Vary LRC parameters trade-off points in 3D space storage overhead reconstruction cost reliability (MTTDL)

Reliability is a hard requirement set MTTDL3-replication as target reduce trade-off space to 2D

31

Page 32: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

0

2

4

6

8

10

12

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Reco

nstr

uctio

n Re

ad C

ost

Storage Overhead

Reed-Solomon

LRC

32

same overhead

half cost (6 3)

same cost

1.5x 1.33x

LRC (12+4+2)

LRC (12+2+2)

RS6+3

RS10+4

• RS10+4: HDFS-RAID at Facebook

• RS6+3: GFS II (Colossus)

at Google

RS12+4

LRC vs. Reed-Solomon

Page 33: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Choice of Windows Azure Storage

33

RS (6 + 3)

reconstruction cost = 6

LRC (14 + 2 + 2)

reconstruction cost = 7

RS (14 + 4)

reconstruction cost = 14

14% savings

Page 34: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Encoding/Decoding Complexity

Optimized erasure coding implementation homomorphic transformation convert complex finite field matrix operations

to XOR’ing large data blocks

XOR scheduling further improves throughput

Encoding/decoding overhead negligible, compared with network/disk overheads

34

Page 35: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Additional Lessons

I/O scheduling reconstruction/recovery/on-demand traffic prioritized and

throttled carefully erasure encoding rate kept up with data injection rate

Data consistency checksum handoff and verification between all levels scrub periodically

35

Page 36: Erasure Coding in Windows Azure Storage - SNIA · Erasure Coding in Windows Azure Storage Cheng Huang Microsoft Corporation . Joint work with Huseyin Simitci, Yikang Xu, Aaron Ogus,

2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Summary

Erasure coding enables significant storage cost savings in Windows Azure Storage with higher reliability than 3-replication

LRC achieves additional 14% savings without compromising performance

Windows Azure Storage Team Blog http://blogs.msdn.com/b/windowsazurestorage/

36