cooperative regenerating codes for distributed storage systems kenneth shum (joint work with yuchong...

Cooperative regenerating codes for distributed storage systems

Kenneth Shum(Joint work with Yuchong Hu)

22nd July 2011

Multiple node failures

• Large-scale storage system– Google data center, example from Kannan’s talk.– 800000 servers, fail rate = 4% per year– Repair in 2 days– Mean number of failed servers in 2 days = 175.

• The lazy-repair policy in TotalRecall– A repair process is triggered only after the number

of failed nodes has reached a certain threshold.

Jul, 2011 2kshum

Jointly repair multiple failures

Jul, 2011

Hu et al. (JSAC, Feb 2010)3

Can we further reduce therepair-bandwidth?

Data exchange

kshum

Storage nodes Newcomers

Distributed storage (erasure coding)

Jul, 2011 4

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2

Data Collector

Wu, Dimakis ISIT09

kshum

Naive Repair

Jul, 2011 5

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2

4 packets required.

A1

A2

B 1, B 2

A 1+B 1

, 2 A 1

+B 2

kshum

Repair with ``code alignment’’

Jul, 2011 6

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2

A1

A2

3 packets required.

B 1+ B 2

A 1+2

A 2+B 1

+ B 2

2 A 1

+ A 2

+B1+

B 2

Solve:P1 = A1+2 A2

P2 = 2 A1+ A2

kshum

Multiple failures, separate repair

Jul, 2011 7

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2

8 packets in total4 packets per newcomer

B1

B2

2 packets

2 packets

2 A1+B1

A2+B2

2 packets

2 packets

kshum

Multiple failures, cooperative repair (I)

Jul, 2011 8

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2


A1 , A

2

2A2+B

2A1+B

1

B1,B2

B1

B2

2 A1+B1

A2+B2

kshum

Multiple failures, cooperative repair (II)

Jul, 2011 9

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2


A 1+B 1

A1

A1

A1+B1

A2

2A2 +B

2 A2

2A2+B2

B 2

B22A

1 +B1

2A1+B1

A2+B2

B1

kshum

Outline of the talk

• Is it optimal in terms of repair-bandwidth?• What is the tradeoff between storage and

repair-bandwidth for cooperative repair?• Can we achieve the Pareto-optimal operating

points on the tradeoff curve by linear network coding?– Exact repair– Functional repair

Jul, 2011 10kshum

In2

Information flow graph

Jul, 2011 11

S

In1 Out1

DataCollector

Out2In3 Out3

In4 Out4

In5 Out5

Out6

Out7

1

1

1

In6

In71

1

1

Mid6Mid7

2

2

kshum

Is this regenerating code optimal ?

Jul, 2011 12

A1

A2

B1

B2

A1+B1

2 A2+B2

A1, A2,B1, B2

2 A1+B1

A2+B2


A 1+B 1

A1

A1

A1+B1

A2

2A2 +B

2 A2

2A2+B2

B 2

B22A

1 +B1

2A1+B1

A2+B2

A1

kshum

In2

First cut

Jul, 2011 13

B

In1 Out1

DataCollector

Out2In3 Out3

In4 Out4

Out6

Out7

Mid6Mid7

2

2

1

1

1

1

B 4 1

In6

In7

kshum

Second cut

Jul, 2011 14

Out1

DataCollector

Out2Out3

Out4

2 Out1

2 Out2

Mid1Mid2

2

2

1

1

1

1

Out3

Out4

Mid3Mid4

2

2

In1In2

In3

In4

1 1

B 2+1+ 2

kshum

A linear programming problem

• Minimize 21+ 2 (repair bandwidth)

• Subject to4 41

4 2+1 + 2

1 , 2 0

Jul, 2011 15

1 1 2 1

2

1

1

1

At least 3 packetskshum

In2

Non-homogeneous download traffic

Jul, 2011 16

B

In1 Out1

DataCollector

Out2In3 Out3

In4 Out4

Out6

Out7

Mid6Mid7

2

2

a

d

c

b

B a +b +c +d

In6

In7

kshum

Non-homogeneous traffic

Jul, 2011 17

Out1

DataCollector

Out2Out3

Out4

2Out1

2 Out2

Mid1Mid2

2

2

1

1

1

1

Out3

Out4

Mid3Mid4

i

j

In1In2

In3

In4

h

f

e

fg

B 2+f +j

kshum


Jul, 2011 18

Out1

DataCollector

Out2Out3

Out4

2Out1

2 Out2

Mid1Mid2

2

2

1

1

1

1

Out3

Out4

Mid3Mid4

i

j

In1In2

In3

In4

h

f

e

fg

B 2+f +j

B 2+h +i

kshum


Jul, 2011 19

Out1

DataCollector

Out2Out3

Out4

2 Out1

2 Out2

Mid1Mid2

2

2

1

1

1

1

Out3

Out4

Mid3Mid4

i

j

In1In2

In3

In4

h

f

e

fg

B 2+f +j

B 2+h +i

B 2+e +j

kshum


Jul, 2011 20

Out1

DataCollector

Out2Out3

Out4

2 Out1

2 Out2

Mid1Mid2

2

2

1

1

1

1

Out3

Out4

Mid3Mid4

i

j

In1In2

In3

In4

h

f

e

fg

B 2+f +j

B 2+h +i

B 2+e +j

B 2+g +i

kshum

The same LP problem

• Minimize• Subject to

Jul, 2011 21

1

1

At least 3 packetskshum

TRADEOFF BETWEENSTORAGE AND REPAIR-BANDWIDTH

Jul, 2011 22kshum

120 130 140 150 160 170 180100

105

110

115

120

125

130

135

140

Repair bandwidth per failed node

Sto

rage

per

nod

e

Storage vs Repair-bandwidth

Jul, 2011 23

One-by-one repair

Repairing 3 newcomers jointly

File size = 420d = 8k = 4

d

DCk

kshum

(S., ICC 2011, Kermarrec, Le Scouamec and Straub, Netcod 2011.)

Fair comparison?

Jul, 2011 24

One-by-one repair

repair degree = 8

Cooperative repair

Sur

vivi

ng n

odes

Sur

vivi

ng n

odes

Number of connectionsper each newcomer = 8

Number of connectionsper each newcomer = 8+2

kshum

120 130 140 150 160 170 180100

105

110

115

120

125

130

135

140


Sto

rage

per

nod

e

MBCR and MSCR

Jul, 2011 25

One-by-one repair

Cooperative repair

Minimum bandwidthcooperative repair (MBCR)

Minimum storagecooperative repair (MSCR)

kshum

480 490 500 510 520 530 540 550450

460

470

480

490

500


Sto

rage

per

nod

e,

How much can we improve?

Jul, 2011 26

One-by-one repair


File size = 2275d = 30k = 5

d

DCk

When d is large,joint repair does not havesignificant advantage overone-by-one repair.

kshum

180 200 220 240 260150

160

170

180

190

200


Sto

rage

per

nod

e,

How much can we improve?

Jul, 2011 27

One-by-one repair


File size = 616d = 8k = 4

d

DCk

Repair-bandwidth reductionis more prominent when d is not so large.

kshum

AN EXPLICIT CONSTRUCTION FOR MINIMUM-BANDWIDTHCOOPERATIVE REPAIR

Jul, 2011 28kshum

An explicit construction for MBCR

Jul, 2011 kshum 29

• Minimum repair-bandwidth

• Storage per node

• B = 8 information packets

• n = 4 nodes• Each node stores 5

packets.• Repair r = 2 failures

simultaneously• No. of connections

for each DC = k=2• No. of helpers for

each failed node =d=2

(S., Hu, ISIT 2011.) Require d = k, r = n–d

Min-Bandwidth point

5 5.5 6 6.5 7 7.5 8 8.5 9

3.5

4

4.5

5

5.5

6


Sto

rage

per

nod

e

Jul, 2011 30kshum

One-by-one repair

Repairing 2 new nodes cooperatively

Data Distribution

8 data packets: A, B, C, D, E, F, G, H

A, B, C, D, F+G

C, D, E, F, H+A

E, F, G, H, B+C

G, H, A, B, D+E

XOR

5 packets: 4 systematic, 1 parity-check

Jul, 2011 31kshum

Data collection

A, B, C, D, F+G

C, D, E, F, H+A

E, F, G, H, B+C

G, H, A, B, D+E

Datacollector

A,B,C,D,E,F,G,H

A, B, C, D

E, F, G, H

Jul, 2011 32kshum

Data collection

A, B, C, D, F+G

C, D, E, F, H+A

E, F, G, H, B+C

G, H, A, B, D+E

Datacollector

A B C D E F G H

Triangular, Full-rank

F+GH+A

ABCDEF

A, B, C, F+G

D, E, F, H+A

Jul, 2011 33kshum

Exact Repair

A, B, C, D, F+G

C, D, E, F, H+A

E, F, G, H, B+C

G, H, A, B, D+E

BA DC

G HE F

F+GB+C

B+C

F+G

How to repair?

Total repair-bandwidth=10

Jul, 2011 34kshum

Exact Repair

A, B, C, D, F+G

C, D, E, F, H+A

E, F, G, H, B+C

G, H, A, B, D+E

C D

G H

D+EE H+A

B+CF+GF

E F

E F

E F

How to repair?

Total repair-bandwidth=10

Jul, 2011 35kshum

Min-Bandwidth point

5 5.5 6 6.5 7 7.5 8 8.5 9

3.5

4

4.5

5

5.5

6


Sto

rage

per

nod

e

Jul, 2011 36kshum

One-by-one repair

Repairing 2 new nodes cooperatively

AN EXPLICIT CONSTRUCTION FOR MINIMUM-STORAGE COOPERATIVE REPAIR

Jul, 2011 37kshum

An explicit construction for MSCR

Jul, 2011 kshum 38

• Minimum repair-bandwidth

• Storage per node

• B = 6 information packets

• n nodes• Each node stores 2

packets.• Repair r = 2 failures

simultaneously• No. of connections

for each DC = k=3• No. of helpers for

each failed node =d=3

(S. ICC 2011.) Require d = k

1 2 3 4 5 6 71

2

3

4

5

6

7

Repair bandwidth per failed node, d

Sto

rage

per

nod

e,

The min-storage point

Jul, 2011 39

Non-cooperative

k=3,d=3,r =2,B=6

Cooperativestorage cost per node = 2repair bandwidth per node = 4

3

DC3

kshum

Data retrieval

Jul, 2011 40

MDS code with dimension k=3Source data

encodecodeword

codeword

Storage nodes ……

Data collector

decode

=2

kshum

Repair : phase 1

Jul, 2011 41

encodecodeword

codeword

Storage nodes lost

lost

decode decodenewcomers

kshum

Source data

Repair: phase 2

Jul, 2011 42

encodecodeword

codeword

Storage nodes

lost

lost

Re-encode Re-encode

exchange

Repair bandwidth per node= 8/2 = 4

newcomers

kshum

1 2 3 4 5 6 71

2

3

4

5

6

7

Repair bandwidth per failed node, d

Sto

rage

per

nod

e,

The construction is optimal

Jul, 2011 43

Non-cooperative

k=3,d=3,r =2,B=6

Cooperativestorage cost per node = 2repair bandwidth per node = 4

3

DC3

kshum

EXISTENCE OF COOPERATIVE REGENERATING CODES UNDER FUNCTIONAL REPAIR

Jul, 2011 44kshum

Existence of optimal linear regenerating codes in general

• Sustainable storage system– Will it work after arbitrarily many repairs?

• Technical difficulty: The information flow graph is unbounded.

• Can we work over a fixed finite field, for unlimited number of regenerations?– Yes if we can construct an exact regenerating code.– The answer is also “yes” for cooperative functional

repair in general.

Jul, 2011 kshum 45

(S., Hu, Netcod 2011.)

Trellis structure

Jul, 2011 kshum 46

mMessage vector(row vector)

…

…

…

…

Stage 0 Stage 1 Stage 2

mT0

T0 is the “transfer matrix” in stage 0

mT0T1



mT0T1T2

Flow in information flow graph

Jul, 2011 kshum 47

S

Out1

Out2

Out3

Out4

In1

In2

Mid1

Mid2

Out1

Out2

5

5

5

5

5

52

2

2

2

1

1

DC

In3

In4

Mid3

Mid4

Out3

Out4

5

5

1

1

2

2

2

2

4

4

4

1

1

3

1

2

5

31

2

2

224

4

0

0

0Out3

Out4

The cut-set bound says that the cut capacity is at least 8.

Can we constructa flow with value 8?

Cross-sectional flow pattern

Jul, 2011 kshum 48

S

Out1

Out2

Out3

Out4

In1

In2

Mid1

Mid2

Out1

Out2

5

5

5

5

52

2

2

2

1

1

DC

In1

In2

Mid1

Mid2

Out1

Out2

5

1

1

2

2

2

2

4

4

4

1

1

3

1

2

5

31

2

2

2

24

4

0

0

0

5

3

0

0

4

4

0

0

4

0

4

0

Out3

Out4

A recursive construction of flow

Jul, 2011 kshum 49

In1

In2

Mid1

Mid2

Out1

Out2

Out3

Out4

Out3

Out4

Stage s Stage s+1

g1

g2

g4

g3

h1

h2

h4

h3

1. Identify a set of cross-section flow pattern, say H.

2. For any cross-section flow pattern (h1, h2, h3, h4) in H stage s+1, we can find a flow in this segment of graph, such that (g1, g2, g3, g4) is also in H.

3. Each pattern corresponds to a submatrix of the transfer matrix.

4. By Schwartz-Zippel lemma, we can find the local encoding vectors so that all such determinants are non-zero, if the finite field is sufficiently large.

Summary• Multiple node failures in medium-scale to

large-scale storage system• Formulation as a linear program• Functional repair: Linear regenerating code

over fixed finite field which matches the cut-set bound on repair-bandwidth exists.

• Exact repair: two families of explicit code constructions– Minimum-bandwidth point: d=k, r = n – d – Minimum-storage point: d=k, r arbitrary

Jul, 2011 50kshum

References• Y. Wu and A. G. Dimakis, Reducing repair traffic for erasure coding-based storage

via interference alignment, ISIT, Jul, 2009.

• Y. Hu, Y. Xu, X. Wang, C. Zhan and P. Li, Cooperative recovery of distributed storage systems from multiple losses with network coding, J. Sel. Area Comm., vol. 28, no. 2, pp.268-275, Feb, 2010.

• K. W. Shum, Cooperative Regenerating Codes for Distributed Storage Systems, ICC, Jun, 2011.

• A.-M. Kermarrec and N. Le Scouarnec and G. Straub, Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes, Netcod, Jul, 2011.

• K. W. Shum and Y. Hu, Existence of Minimum-Repair-Bandwidth Cooperative Regenerating Codes, Netcod, Jul, 2011.

• K. W. Shum and Y. Hu, Exact Minimum-Repair-Bandwidth Cooperative Regenerating Codes for Distributed Storage Systems, ISIT, Aug, 2011.

Jul, 2011 kshum 51

cooperative regenerating codes for distributed storage systems kenneth shum (joint work with yuchong...

Documents

packets kshum slide

a1a1 kshum slide

cooperative repair

b1b1 kshum slide

a1a2a1a2 b1b2b1b2

naive repair

separate repair

repair process