rim moussa [email protected] ceria.dauphine.fr/rim/rim.html

65
Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS Rim Moussa Rim Moussa [email protected] [email protected] http://ceria.dauphine.fr/rim/ http://ceria.dauphine.fr/rim/ rim.html rim.html Thesis Presentation in Computer Science *Distributed Databases Thesis Supervisor: Pr. Witold Litwin Examinators: Pr. Thomas J.E. Schwarz Pr. Toré Risch Jury President: Pr. Gérard Lévy Paris Dauphine University *CERIA Lab. *04th October 2004

Upload: kesia

Post on 18-Jan-2016

79 views

Category:

Documents


3 download

DESCRIPTION

Paris Dauphine University *CERIA Lab. *04th October 200 4. Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS. Rim Moussa [email protected] http://ceria.dauphine.fr/rim/rim.html. Thesis Supervisor: Pr. Witold Litwin - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

Contribution to the Design & Implementation

of the Highly Available Scalable and

Distributed Data Structure: LH*RS

Rim MoussaRim Moussa [email protected] [email protected]

http://ceria.dauphine.fr/rim/rim.htmlhttp://ceria.dauphine.fr/rim/rim.html

Thesis Presentation in Computer Science *Distributed Databases

Thesis Supervisor: Pr. Witold Litwin

Examinators: Pr. Thomas J.E. Schwarz

Pr. Toré Risch

Jury President: Pr. Gérard Lévy

Paris Dauphine University

*CERIA Lab.*04th October 2004

Page 2: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 2

Outline

1. Issue

2. State of the Art

3. LH*RS Scheme

4. LH*RS Manager

5. Experimentations

6. LH*RS File Creation

7. Bucket Recovery

8. Parity Bucket Creation

9. Conclusion & Future Work

Page 3: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 3

Facts …

Volume of Information of 30% /year

Technology

Network Infrastructure

>> Gilder Law, bandwidth triples every year.

Evolution of PCs storage & computing capacities

>> Moore Law, the latters double every 18 months.

Bottleneck of Disks Accesses & CPUs

Need of Distributed Data Storage SystemsSDDSs: LH*, RP* … High Throughput

Page 4: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 4

Facts …

Network

Frequent & Costly Failures

>> Stat. Published by the Contingency Planning Research in

1996: the cost of service interruption/h case of brokerage

application is $6,45 million.

Need of Distributed & Highly-Available Data Storage

Systems

Multicomputers

>> Modular Architecture >> Good Price/ Performance Tradeoff

Page 5: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 5

State of the Art

Parity Calculus

(+)(+) Good Response Time, Mirors are functional

(-) High Storage Overhead (n if n repliquas)

Data Replication

Criteria to evaluate Erasure-resilient Codes:

Encoding Rate (Parity Volume/ Data Volume)

Update Penality (Parity Volumes)

Group Size used for Data Reconstruction

Encoding & Decoding Complexity

Recovery Capabilitties

Page 6: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 6

Parity Schemes

1-Available Schemes

k-Available Schemes

Binary Linear Codes: [H94]

Tolerate max. 3 failures

Array Codes: EVENODD [B94 ], X-code [XB99], RDP [C+04]

Tolerate max. 2 failures

Reed Solomon Codes : IDA [R89], RAID X [W91], FEC [B95],

Tutorial [P97], LH*RS [LS00, ML02, MS04, LMS04]

Tolerate k failures (k > 3)

XOR Parity Calculus : RAID Technology (level 3, 4, 5…) [PGK88],

SDDS LH*g [L96] …

Page 7: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 7

Outline…

1. Issue

2. State of the Art

3. LH*RS Scheme

LH*RS?

SDDSs?

Reed Solomon Codes?

Encoding/ Decoding Optimizations

4. LH*RS Manager

5. Experimentations

Page 8: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 8

LH*RS ?

Distribution using Linear Hashing (LH*LH [KLR96])

LH*LH Manager[B00]

Scalability & High Throughput

High Availability

LH*: Scalable & Distributed Data Structure

Parity Calculus using Reed-Solomon Codes [RS63]

LH*RS [LS00]

Page 9: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 9

SDDSs Principles

(1) Dynamic File Growth

Client

Network

Client…

Data Buckets

…OVERLOADED

You Split Insertions

Coordinator

Record Transfert

Page 10: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 10

SDDSs Principles (2)

Network

(2) No Centralized Directory Access

Cases de Données

……

Client

Query Query Forward

Client Image

Adjustment

Message

File Image

Page 11: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 11

Reed-Solomon Codes

Encoding

From m Data Symbols Calculus of n Parity Symbols

Data Representation Galois Field

Fields with finite size: q

Closure Propoerty: Addition, Substraction,

Multiplication, Division.

In GF(2w),(1) Addition (XOR)(2) Multiplication (Tables: gflog and antigflog) e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ]

Page 12: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 12

1 0 0 0 0 … 0 C1,1 … C1,j … C1,n-m

0 1 0 0 0 … 0 C2,1 … C2,j … C2,n-m

0 0 1 0 0 … 0 C3,1 … C3,j … C3,n-m

… … … … …

0 00 0 0 … 1 Cm,1 … Cm,j … Cm,n-m

RS Encoding

S1

S2

S3

:

Si

:

Sm

S1

:

Sm

P1

P2

:

Pj

:

Pn-m

=

C1,j

C2,j

C3,j

:

Cm,j Pj

(S1 C1,j) (S2 C2,j) … (Sm Cm,j)

m-1 XORs GF

m Multiplications GF

S1

S2

S3

:

Si

:

Sm

Im P(m(n-m))

(1) Systematic Encoding: Matrix (Im|P)

(2) Any m columns are linearly independent

Parity Matrix

Page 13: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 13

Optimized Decoding

Multiply the ‘‘m OK

symbols’’

By columns of H-1

corresponding to lost symbols

m OK symbols

Hm: m corresponding

columns

H-1 = [ S1 S2 S3 S4 ….. Sm ]

Gauss Transformatiom

1 0 0 0 0 … 0 C1,1 C1,2 C1,3 … C1,n-m

0 1 0 0 0 … 0 C2,1 C2,2 C2,3 … C2,n-m

0 0 1 0 0 … 0 C3,1 C3,2 C3,3 … C3,n-m

… … … … …

0 0 0 0 0 … 1 Cm,1 Cm,2 Cm,3 … Cm,n-m

RS DecodingS1

S2

S3

S4

:Sm

P1

P2

P3

: Pn-m

Page 14: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 14

Galois Field Parity Matrix

Optimizations

GF Multiplication

(+)

GF(216) vs. GF(28) reduces the #Symbols by 1/2

#Operations in the GF.

GF(28) 1 symbol = 1 Byte

GF(216) 1 symbol = 2 Bytes

(-)

Multiplication Tables Size

GF(28): 0,768 Ko

GF(216): 393,216 Ko (512 0,768)

Page 15: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 15

Galois Field Parity Matrix

Optimizations (2)

GF Multiplication

1st Column of ‘1’s

Encoding of the 1st PB along XOR Calculus

Gain in encoding & decoding

1st Row of ‘1’s

Any update from 1st DB is processed with XOR Calculus

Gain in Performance of 4% (case PB creation, m =4)

0001 0001 0001 …

0001 eb9b 2284 …

0001 2284 9é74 …

0001 9e44 d7f1 …

… … … …

Page 16: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 16

Galois Field Parity Matrix

Optimizations (3)

GF Multiplication

EncodingEncoding

Log Pre-calculus of the Coef. of P Matrix

Improvement of 3,5%

0000 0000 0000 …

0000 5ab5 e267 …

0000 e267 0dce …

0000 784d 2b66 …

… … … …

DecodingDecoding

Log Pre-calculus of coef.

of H-1 matrix and OK

symbols vector

Improvement of 4% to

8% depending on the

#buckets to recover

Goal: Reduce GF Multiplication Complexity

e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ]

Page 17: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 17

LH*RS -Parity Groups

Data Buckets

Parity Buckets

: Key; Data

Insert Rank

r

: Rank; [Key-list ]; Parity

Key r

2

1

0

2

1

0

A k-Acvailable Group survive to the failure of k buckets

Grouping Concept m: #data buckets k: #parity buckets

Page 18: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 18

Outline…

1. Issue

2. State of the Art

3. LH*RS Scheme

4. LH*RS Manager

Communication

Gross Architecture

5. Experimentations

6. File Creation

7. Bucket Recovery …

Page 19: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 19

Communication

TCP/IPTCP/IPUDP ““Multicast”Multicast”

Individual Operations

(Insert, Update, Delete, Search)

Record Recovery

Control Messages

Performance

Page 20: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 20

Communication

TCP/IPUDPUDP “Multicast”

Large Buffers Transfert

New Parity Buckets

Transfer Parity Update & Record (Bucket Split)

Bucket Recovery

Performance & Reliability

Page 21: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 21

Communication

TCP/IPUDP “Multicast”

Looking for New Data/Parity Buckets

Communication Multipoints

Page 22: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 22

Architecture

(1) TCP/IP Connection Handler

Principle of “Sending Credit & Message Conservation

until delivery” [J88, GRS97, D01]

1 Bucket Recovery (3,125 MB):

SDDS 2000: 6,7 s SDDS2000-TCP: 2,6 s

(Hardware Config.: CPU 733MhZ machines, network 100Mbps)

Before

Improvement of 60%

TCP/IP Connections are passive OPEN,

RFC 793 –[ISI81], TCP/IP under Win2K Server OS [MB00]

(2) Flow Control & Message Acknowledgement (FCMA)

Enhancements to SDDS2000 Architecture:

Page 23: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 23

Architecture (2)

Befor

e

To tag new servers (data or parity) using Multicast:

(3) Dynamic IP Addressing Structure

Pre-defined and Static IP@s Table

Multicast Group of Blank Data

Buckets

Multicast Group of Blank Parity Buckets

Coordinator

Created Buckets

Page 24: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 24

Architecture (3)

Multicast Listening

Port

UDP Sending Port

TCP/IP Port

UDP Listening

Port

UDP Listening Thread

Messages Queue

TCP Listening Thread

Multicast listening Thread

Message Queue

Pool of Working Threads

Network

ACK Mgmt Threads

Free Zones

Messages waiting for ACK.

Not acquittedMessages

ACK Structure

Multicast Working Thread

Page 25: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 25

Experimentation

Performance Evaluation

* CPU Time

* Communication Time

Experimental Environment

* 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)

* Ethernet Network 1 Gbps

* O.S.: Win2K Server

* Tested Configuration: 1 Client,

A group of 4 Data Buckets,

k Parity Buckets (k = 0,1,2,3).

Page 26: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 26

Outline…

1. Issue

2. State of the Art

3. LH*RS Scheme

4. LH*RS Manager

5. Experimentations

6. File Creation

Parity Update

Performance

7. Bucket Recovery

8. Parity Bucket Creation

Page 27: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 27

File Creation Client Operations

Propagation of Data Record Inserts/ Updates/ Deletes to Parity Buckets.

Update: Send only –record.

Deletes: Management of Free Ranks within Data Buckets.

Data Bucket Split

N1: #renaining records

N2: #leaving records

Parity Group of the Splitting Data Bucket

N1+N2 Deletes + N1 Inserts

Parity Group of the New Data Bucket

N2 Inserts

Page 28: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 28

Performances

Config.Config. Client Window = 1Client Window = 1 Client Window = 5Client Window = 5

Max Bucket Size = 10 000 records

File of 25 000 records

1 record = 104 Bytes

No difference GF(28) et GF(216) (we don’t wait for ACKs between DBs and PBs)

Page 29: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 29

Performances

Config.Config. Client Window = 1Client Window = 1 Client Window = 5Client Window = 5

7,896s9,990s

10,963s

0,0002,0004,0006,0008,000

10,00012,000

0 5000 10000 15000 20000 25000

Inserted Keys

File Creation Time

(sec)

k = 0k = 1k = 2

kk = 0 ** = 0 ** kk = 1 = 1 Perf. Degradation of 20% Perf. Degradation of 20%

kk = 1 ** = 1 ** kk = 2 = 2 Perf. Degradation of 8% Perf. Degradation of 8%

Page 30: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 30

Performances

Config.Config. Client Window = 1Client Window = 1 Client Window = 5Client Window = 5

kk = 0 ** = 0 ** kk = 1 = 1 Perf. Degradation of 37% Perf. Degradation of 37%

kk = 1 ** = 1 ** kk = 2 = 2 Perf. Degradation of 10% Perf. Degradation of 10%

4,349s

6,940s7,720s

0

2

4

6

8

10

0 5000 10000 15000 20000 25000

Number of Inserted Keys

File Creation Time

(sec)

k = 0k = 1k = 2

Page 31: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 31

Outline…1. Issue

2. State of the Art

3. LH*RS Scheme

4. LH*RS Manager

5. Experimentations

6. File Creation

7. Bucket RecoveryScenarioPerformances

8. Parity Bucket Creation

Page 32: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 32

Failure Detection

Are you Alive?

Data Buckets

Parity Buckets

Scenario

Coordinator

Page 33: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 33

Waiting for Responses …

OK

Data Buckets

Parity Buckets

Scenario (2)

OK OKOK

Coordinator

Page 34: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 34

Searching Spare Buckets …

Wanna be

Spare ?

Scenario (3)

Multicast Group of Blank Data Buckets

Coordinator

Page 35: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 35

Waiting for Replies …

Launch UDP Listening Launch

TCP Listening, Launch Working

Thredsl

*Waiting for Confirmation* If Time-out elapsed cancel everything

I would

Scenario (4)

Multicast Group of Blank Data Buckets

CoordinatorI would

I would

Page 36: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 36

Spare Selection

Scenario (5)

Multicast Group of Blank Data Buckets

Confirmed

Cancellation

Confirmed

You are HiredCoordinator

Page 37: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 37

Parity Buckets

Recover Failed Buckets

Scenario (6)

Recovery Manager Selection

Coordinator

Page 38: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 38

Data Buckets

Parity Buckets

Recovery Manager

Spare Buckets

Buckets participating to Recovery

Send me Records of rank in [r, r+slice-1]

Scenario (7)

Query Phase

Page 39: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 39

Decoding Phase

Recovered Slices

Data Buckets

Parity Buckets

Spare Buckets

Buckets participating to Recovery

Requested Buffers

Scenario (8)

Reconstruction Phase

Recovery Manager

In // with Query Phase

Page 40: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 40

2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS

Performances

File Info

File of 125 000 records

Record Size = 100 bytes

Bucket Size = 31250 records 3.125 MB

Group of 4 Data Buckets (m = 4), k-Available with k = 1,2,3

Decoding

* GF(216)

* RS+ Decoding (RS + log Pre-calculus of H-1 and OK Symboles Vector)

Recovery per Slice (adaptative to PCs storage & computing capacities)

Page 41: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 41

2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS

Performances

SliceTotal Time (sec)

CPU Time (sec)

Com. Time (sec)

1250 0,625 0,266 0,348

3125 0,588 0,255 0,323

6250 0,552 0,240 0,312

15625 0,562 0,255 0,302

31250 0,578 0,250 0,328

Slice (from 4% to 100% of a bucket content)

Total Time is almost constant

0,58

Page 42: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 42

2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS

Performances

SliceTotal Time (sec)

CPU Time (sec)

Com. Time (sec)

1250 0,734 0,349 0,365

3125 0,688 0,359 0,323

6250 0,656 0,354 0,297

15625 0,667 0,360 0,297

31250 0,688 0,360 0,328

0,67

Slice (from 4% to 100% of a bucket content)

Total Time is almost constant

Page 43: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 43

2 DBs1 DB XORConfig.

Performances

Time to Recover 1DB -XOR : 0,58 sec

XOR in GF(216) realizes a gain of 13% in Total Time

(and 30% in CPU Time)

Time to Recover 1DB –RS : 0,67 sec

1 DB RS XOR vs. RS

Page 44: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 44

3 DBs2 DBs SummaryXOR vs. RS1 DB RS

Performances

SliceTotal Time (sec)

CPU Time (sec)

Com. Time (sec)

1250 0,976 0,577 0,375

3125 0,932 0,589 0,338

6250 0,883 0,562 0,321

15625 0,875 0,562 0,281

31250 0,875 0,562 0,313

0,9

Slice (from 4% to 100% of a bucket content)

Total Time is almost constant

Page 45: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 45

3 DBs2 DBs SummaryXOR vs. RS1 DB RS

Performances

Slice Total Time (sec)

CPU Time (sec)

Com. Time (sec)

1250 1,281 0,828 0,406

3125 1,250 0,828 0,390

6250 1,211 0,852 0,352

15625 1,188 0,823 0,361

31250 1,203 0,828 0,375

1,23

Slice (from 4% to 100% of a bucket content)

Total Time is almost constant

Page 46: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 46

Performances

3 DBs2 DBs SummaryXOR vs. RS1 DB RS

fBucket

Size (MB)Total Time

(sec)

Recovery Speed

(MB/sec)

1 (XOR)1 (RS)

3,1250,58 5.38

0,67 4.66

2 6,250 0,9 6.94

3 9,375 1,23 7,62

Time to Recover f Buckets f Time to Recover 1 Bucket

Factorized Query Phase The + is Decoding Time & Time to send Recovered Buffers

Page 47: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 47

Performances

GF(28)

XOR in GF(28) improves decoding perf. of 60% compared to RS in GF(28).

RS/RS+ decoding in GF(216) realize a gain of 50% compared to decoding in GF(28).

3 DBs2 DBs SummaryXOR vs. RS

Page 48: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 48

Outline…

1. Issue

2. State of the Art

3. LH*RS Scheme

4. LH*RS Manager

5. Experimentations

6. File Creation

7. Bucket Recovery

8. Parity Bucket Creation

ScenarioPerformances

Page 49: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 49

Scenario

Multicast Group of Blank Parity Buckets

Wanna Join Group g ?

Searching for a new Parity Bucket

Coordinator

Page 50: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 50

Scenario (2)

Coordinator

I Would

Launch UDP Listening Launch

TCP Listening, Launch Working

Thredsl

*Waiting for Confirmation* If Time-out elapsed cancel everything

Waiting for Replies …

Multicast Group of Blank Parity Buckets

I Would

I Would

Page 51: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 51

Scenario (3)

You are Hired

Confirmed

Cancellation

Cancellation

New Parity Bucket Selection

Multicast Group of Blank Parity Buckets

Coordinator

Page 52: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 52

Send me your contents ! …

Scenario (4)

Group of Data Buckets

New Parity Bucket

Auto-creation *Query Phase

Page 53: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 53

Requested Buffers…

Scenario (5)

Group of Data Buckets

Buffer Processing

Auto-creation *Encoding Phase

New Parity Bucket

Page 54: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 54

Performances

Max Bucket Size : 5000 .. 50000 records

Bucket Load Factor: 62,5%

Record Size: 100 octets

Group of 4 Data Buckets

Encoding

GF(216)

RS++ ( Log Pre-calculus & Row ‘1’s XOR encoding to Process 1st DB buffer)

XOR RS XOR vs. RSConfig.

GF(28)

Page 55: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 55

Performances

Bucket Size

Total Time (sec)

CPU Time (sec)

Com. Time (sec)

5000 0.190 0.140 0.029

10000 0.429 0.304 0.066

25000 1.007 0.738 0.144

50000 2.062 1.484 0.322

XOR RS XOR vs. RSConfig.

GF(28)

Same Encoding Rate

Bucket Size: CPU Time 74% Total Time

Page 56: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 56

Performances

Bucket Size

Total Time (sec)

CPU Time (sec)

Com. Time (sec)

5000 0.193 0.149 0.035

10000 0.446 0.328 0.059

25000 1.053 0.766 0.153

50000 2.103 1.531 0.322

XOR RS XOR vs. RSConfig.

GF(28)

Same Encoding Rate

Bucket Size: CPU Time 74% Total Time

Page 57: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 57

Performances

XOR encoding speed : 2.062 sec

RS encoding speed: 2.103 sec

XOR realizes a performance gain in CPU time

of 5% ( only 0,02% on Total Time)

For Bucket Size = 50000 records

XOR RS XOR vs. RSConfig.

GF(28)

Page 58: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 58

XOR RS XOR vs. RSConfig.

GF(28)

Performances

Idem GF(216), CPU Time = 3/4 Total Time

XOR in GF(28) improves CPU Time by 22%

Page 59: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 59

Performance

File Creation Rate0.33MB/s for k = 0

0.25MB/s for k = 1

0.23MB/s for k = 2

Record Insert Time0.29ms for k = 0

0.33ms for k = 1

0.36ms for k = 2

Bucket Recovery Rate4.66MB/s from 1-unavailability

6.94MB/s from 2-unavailability

7.62MB/s from 3-unavailability

Record Recovery TimeAbout 1.3ms

Key Search TimeIndividual> 0.24ms

Bulk> 0.056ms

Wintel P4, 1.8GHz, 1Gbps

Page 60: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 60

Conclusion

Experiments prove:

Optimizations

Encoding/ Decoding

Architecture

Impact on Performance

Good Recovery Performances

Page 61: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 61

Future Work

Update Propagation to Parity Buckets Reliability

Performance

Reduce Coordinator Tasks

« Parity Declustering »

Investigation of New Erausure-Resilient Codes

Page 62: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 62

References

[PGK88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.

[ISI81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html

[MB 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html

[J88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329.

[XB99] L. Xu & J. Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999.

[CEG+ 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004.

[R89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.

[W91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf

[GRS97] J. C. Gomez, V. Redo, V. S. Sunderam, Efficient Multithreaded User-Space Transport for Network Computing, Design & Test of the TRAP protocol, Journal of Parallel & Distributed Computing, 40 (1) 1997.

Page 63: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 63

References (2)

[BK+ 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

[LS00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed

Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.

[KLR96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag.

[RS60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960. 

[P97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,

[D01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.

[B00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.

+ Références: http://ceria.dauphine.fr/rim/theserim.pdf+ Références: http://ceria.dauphine.fr/rim/theserim.pdf

Page 64: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 64

Publications

[ML02] R. Moussa, W. Litwin, Experimental Performance Analysis of LH*RS Parity Management, Carleton Scientific Records of the 4th International Workshop on Distributed Data & Structure : WDAS 2002, p.87-97.

[MS04] R. Moussa, T. Schwarz, Design and Implementation of LH*RS – A Highly-

Available Scalable Distributed Data Structure, Carleton Scientific Records of the 6th International Workshop on Distributed Data & Structure: WDAS 2004.

[LMS04] W. Litwin, R. Moussa, T. Schwarz, Prototype Demonstration of LH*RS: A

Highly Available Distributed Storage System, Proc. of VLDB 2004 (Demo Session) p.1289-1292.

[LMS04-a] W. Litwin, R. Moussa, T. Schwarz, LH*RS: A Highly Available

Distributed Storage System, journal version submitted, under revision. 

Page 65: Rim Moussa Rim.Moussa@dauphine.fr  ceria.dauphine.fr/rim/rim.html

Thank You For Your Thank You For Your AttentionAttention

Questions ?Questions ?