dstore: an easy-to-manage persistent state store

DStore: An Easy-to-Manage DStore: An Easy-to-Manage Persistent State StorePersistent State StoreAndy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University

ROC Retreat – June 2004 © 2004 Andy Huang

OutlineOutline

Project overviewProject overview

Consistency guaranteesConsistency guarantees

Failure detectionFailure detection

BenchmarksBenchmarks

Next steps and bigger pictureNext steps and bigger picture


Background: Scalable CHTsBackground: Scalable CHTs

FrontendsApp Servers

DBs

LAN

LAN

Cluster hash tables (CHTs)

Single-key-lookup dataSingle-key-lookup data

• Yahoo! user profilesYahoo! user profiles

• Amazon catalog metadataAmazon catalog metadata

Underlying storage layerUnderlying storage layer

• Inktomi:Inktomi:wordID wordID docID list docID listdocID docID document document metadatametadata

• DDS/Ninja:DDS/Ninja:atomic compare-and-swapatomic compare-and-swap


• Our online repartitioning Our online repartitioning algorithm lowers scaling algorithm lowers scaling costcost

• Reactive scaling adjusts Reactive scaling adjusts capacity to match current capacity to match current loadload

• Lowers the cost of acting Lowers the cost of acting on false positiveon false positive

• Effective failure detection Effective failure detection not contingent on not contingent on accuracyaccuracy

DStore: An easy-to-manage CHTDStore: An easy-to-manage CHT

Capacity planningCapacity planning

• High scaling costs High scaling costs necessitate accurate load necessitate accurate load predictionprediction


• FastFast detection is at odds detection is at odds with with accurateaccurate detection detection

Cheap recoveryCheap recovery

Predictably fast and predictably small impact on availability/performancePredictably fast and predictably small impact on availability/performance

C H

A L

L E

N G

E S

B E

N E

F I T

S

Manage like stateless frontendsManage like stateless frontends


• Sacrifice some Sacrifice some consistency: Well-defined consistency: Well-defined guarantees that provide guarantees that provide consistent orderingconsistent ordering

• Higher replication factor: Higher replication factor: 2N+1 bricks to tolerate N 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA)failures (vs. N+1 in ROWA)

Single-phase writesSingle-phase writes

• No locking and No locking and transactional loggingtransactional logging

QuorumsQuorums

• No recovery code to freeze No recovery code to freeze writes & copy missed writes & copy missed updatesupdates

Cheap recovery: Principles and costsCheap recovery: Principles and costsC

O S

T S

T E

C H

N I Q

U E

S

Trade storage and consistency for cheap recoveryTrade storage and consistency for cheap recovery

Write: send to all, wait for majority

Read: read from majority

dlib dlib


Nothing new under the sun, but…Nothing new under the sun, but…

Ease of managementEase of managementScalable performanceScalable performanceCHTCHT

Cheap recovery (but Cheap recovery (but that’s just the start…)that’s just the start…)

High availability and High availability and performance (end goal)performance (end goal)

ResultResult

Availability and Availability and performance while performance while nodes are unavailablenodes are unavailable

Relaxed Relaxed consistencyconsistency

Availability during Availability during failures and recoveryfailures and recovery

Availability during Availability during network partitions network partitions and Byzantine faultsand Byzantine faults

QuorumsQuorums

DStoreDStorePrior workPrior workTechniqueTechnique


Cheap recovery simplifies state Cheap recovery simplifies state managementmanagement

[Future work][Future work][RAID][RAID]Data Data reconstructionreconstruction

Manage state with Manage state with techniques used for techniques used for stateless frontendsstateless frontends

State management is State management is costly (administration- costly (administration- and availability-wise)and availability-wise)

ResultResult

Scale reactively Scale reactively based on current loadbased on current load

Predict future loadPredict future loadCapacity Capacity planningplanning

Duration and impact Duration and impact is predictably smallis predictably small

Relatively new area Relatively new area [Aqueduct][Aqueduct]

Online Online repartitioningrepartitioning

Effective even if it is Effective even if it is not highly accuratenot highly accurate

Difficult to make fast Difficult to make fast and accurateand accurate

Failure Failure detectiondetection

DStoreDStorePrior workPrior workChallengeChallenge


OutlineOutline

Project overviewProject overview



BenchmarksBenchmarks

Next steps and bigger pictureNext steps and bigger picture



Usage model:Usage model:

Guarantee: For a key k, DStore enforces a global order of Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual operations that is consistent with the order seen by individual clients.clients.

CC11 issues w issues w11(k, v(k, vnewnew) to replace current hash table entry (k, v) to replace current hash table entry (k, voldold))

ww11 returns SUCCESS: subsequent reads return v returns SUCCESS: subsequent reads return vnewnew

ww11 returns FAIL: subsequent reads return v returns FAIL: subsequent reads return voldold

ww11 return UNKNOWN (due to Dlib failure): two cases return UNKNOWN (due to Dlib failure): two cases

dlibc

1.1. A client issues a requestA client issues a request2.2. Request forwarded to a random Request forwarded to a random

DlibDlib3.3. Dlib issues quorum r/w on Dlib issues quorum r/w on

bricksbricks• Assumption: Clients share data, Assumption: Clients share data,

but otherwise act but otherwise act independentlyindependently


Case 1: Another user UCase 1: Another user U22 performs a performs a readread

U1 B1 B2 B3

(k1,vold)

U2

Dlib failure can cause a partial write, violating the quorum property

If timestamps differ, read-repair restores majority invariant

Delayed commit

w1(k1,vnew)

vold

r1(k1)

vnew

w2(k1,vnew)

r2(k1)

U2 r(k1) returns:

vold – no user has read vnew

vnew – no user will later read vold


Case 2: UCase 2: U11 performs a read performs a read

B1 B2 B3U1 U2

vnew

r1(k1)

w2(k1,vnew)

A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read

(k1,vold)

w1(k1,vnew)

U1 r(k1): write is immediately committed or aborted – all future readers see either vold or vnew



CC11 issues w issues w11(k, v(k, vnewnew) to replace current hash table entry (k, ) to replace current hash table entry (k,

vvoldold))

ww11 returns SUCCESS: subsequent reads return v returns SUCCESS: subsequent reads return vnewnew

ww11 returns FAIL: subsequent reads return v returns FAIL: subsequent reads return voldold

ww11 return UNKNOWN (due to Dlib failure): return UNKNOWN (due to Dlib failure):

UU11 reads – w reads – w11 is immediately committed or aborted is immediately committed or aborted

UU22 reads – if v reads – if voldold is returned, no user has read v is returned, no user has read vnewnew

if vif vnewnew is returned, no user will later read v is returned, no user will later read voldold


Two-phase commit vs. single phase Two-phase commit vs. single phase writeswrites

No special-case recoveryNo special-case recoveryRead log to complete inRead log to complete in progress transactions progress transactions

RecoveryRecovery

Read-repair (spreads outRead-repair (spreads out the cost of 2-PC to make the cost of 2-PC to make common case faster) common case faster)Write-in-progress cookieWrite-in-progress cookie (spreads out the (spreads out the responsibility of 2-PC) responsibility of 2-PC)

NoneNoneOther costsOther costs

1 synchronous update1 synchronous update1 roundtrip1 roundtrip

2 synchronous log writes2 synchronous log writes2 roundtrips2 roundtrips

PerformancPerformancee

No lockingNo lockingLocking may causeLocking may cause request to block request to block during failures during failures

AvailabilityAvailability

Consistent orderingConsistent orderingSequential consistencySequential consistencyConsistencyConsistency

Single-phase writesSingle-phase writes2-phase commit2-phase commitPropertyProperty


Recovery behaviorRecovery behavior

0

25

50

75

100

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

25

50

75

100

0 5 10 15 20 25 30

PUT

req/

sec

Time (minutes)

0

25

50Repairs/sec

0

25

50Repairs/sec

0K

1K

2K

3K

4K

GET

req/

sec

0K

1K

2K

3K

4K

GET

req/

sec

Run at 100% capacity

Typically, run at 60-70% max utilization

Predictably fast and small impactPredictably fast and small impact

Recovery


Application-generic failure detectionApplication-generic failure detection

Operating statistics (CPU load, requests processed, etc.)

Beacon listener

Median absolute deviation

Tarzan algorithm

Anomalies

Failure detection techniques

> treshold

reboot

Simple detection techniques “work” because resolution mechanism is cheap


Failure detection and repartitioning Failure detection and repartitioning behaviorbehavior

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50

100

150

200

0 5 10 15

PUT

req/

sec

Time (minutes)

0

50Repairs/sec

0

50Repairs/sec

0K

4K

8K

GET

req/

sec

0K

4K

8K

GET

req/

sec

Aggressive failure detection

0

60

120

0 5 10 15 20 25 30 35 40

PUT

req/

sec

Time (minutes)

0

60

120

0 5 10 15 20 25 30 35 40

PUT

req/

sec

Time (minutes)

0

25

50Repairs/sec

0

25

50Repairs/sec

0K

2.5K

5K

GET

req/

sec

# bricks

3 4 5 6

0K

2.5K

5K

GET

req/

sec

# bricks

3 4 5 6

Online repartitioning

Low scaling costLow scaling costLow cost of acting on false positivesLow cost of acting on false positives

Fail-stutter


reboot

Bigger picture: What is “self-Bigger picture: What is “self-managing”?managing”?

Brick performanceIndicator

Monitoring

Treatment

a sign of system health

tests for potential problems

low-impact

resolution mechanis

m



Brick performance

System load

Disk failures



Brick performance

System load

Disk failuresKey: low-

cost mechanism

s

Simple detection mechanis

ms & policies

Constant “recover

y”

reboot

repartition

reco

nstru

ctio

n

dstore: an easy-to-manage persistent state store

Documents

andy huangbackground

swaproc retreat

casesroc retreat

andy huangcheap recovery

andy huangcase

andy huangdstore

bigger pictureroc retreat

n failures