jeff terrace michael j. freedman

Jeff TerraceJeff TerraceMichael J. FreedmanMichael J. Freedman

Object Storage on CRAQObject Storage on CRAQ

High throughput chain replication forHigh throughput chain replication forread-mostly workloadsread-mostly workloads

Data Storage Revolution

• Relational Databases

• Object Storage (put/get)– Dynamo– PNUTS– CouchDB– MemcacheDB– Cassandra

SpeedScalabilityAvailabilityThroughput

No Complexity

Eventual Consistency

Manager

Replica

ReplicaReplica

Write Request

Read RequestReplica

Read Request

Eventual Consistency

• Writes ordered after commit

• Reads can be out-of-order or stale

• Easy to scale, high throughput

• Difficult application programming model

Traditional Solution to Consistency

Manager

Replica

ReplicaReplica

Write Request

Two-Phase Commit:

1.Prepare2.Vote: Yes3.Commit4.Ack

Strong Consistency

• Reads and Writes strictly ordered

• Easy programming

• Expensive implementation

• Doesn’t scale well

Our Goal

• Easy programming

• Easy to scale, high throughput

Chain Replication

Manager

Replica

ReplicaReplica

HEAD TAIL

Write Request Read

Request

W1R1W2R2R3

van Renesse & Schneider(OSDI 2004)

Chain Replication

• Strong consistency

• Simple replication

• Increases write throughput

• Low read throughput

• Can we increase throughput?

• Insight:– Most applications are read-heavy (100:1)

• Two states per object – clean and dirty

Replica TAILReplicaReplicaHEAD

Read Request

V1 V1 V1 V1 V1

• Two states per object – clean and dirty

• If latest version is clean, return value

• If dirty, contact tail for latest version number

V1 V1 V1 V1 V1

Write Request

,V2 ,V2 ,V2

Read Request

Read Request 1V1

,V2 V2V2V2V2V2

Multicast Optimizations

• Each chain forms group

• Tail multicasts ACKs

V1 V1 V1 V1,V2 ,V2 ,V2 ,V2 V2V2V2V2V2

Multicast Optimizations

• Each chain forms group

• Tail multicasts ACKs

• Head multicasts write data

V2 V2 V2 V2

Write Request

,V3 ,V3 ,V3 ,V3 V2,V3V3

CRAQ Benefits

• From Chain Replication– Strong consistency– Simple replication– Increases write throughput

• Additional Contributions– Read throughput scales :

• Chain Replication with Apportioned Queries

– Supports Eventual Consistency

High Diversity

• Many data storage systems assume locality– Well connected, low latency

• Real large applications are geo-replicated– To provide low latency– Fault tolerance

(source: Data Center Knowledge)

TAILTAIL

Multi-Datacenter CRAQ

Replica

TAILTAILReplic

aReplic

Replica

Multi-Datacenter CRAQ

Replica

TAILTAILReplic

aReplic

Replica

ClientClient

Motivation

1. Popular vs. scarce objects

2. Subset relevance

3. Datacenter diversity

4. Write locality

Solution

1. Specify chain size

2. List datacenters− dc1, dc2, … dcN

3. Separate sizes– dc1, chain_size1, …

4. Specify master

Chain Configuration

Master Datacenter

Replica

Writer

TAILTAIL

Replica

Implementation

• Approximately 3,000 lines of C++

• Uses Tame extensions to SFS asynchronousI/O and RPC libraries

• Network operations use Sun RPC interfaces

• Uses Yahoo’s ZooKeeper for coordination

Coordination Using ZooKeeper

• Stores chain metadata

• Monitors/notifies about node membership

CRAQ CRAQCRAQ

CRAQCRAQ

ZooKeeper

Evaluation

• Does CRAQ scale vs. CR?

• How does write rate impact performance?

• Can CRAQ recover from failures?

• How does WAN effect CRAQ?

• Tests use Emulab network emulation testbed

Read Throughput as Writes Increase

Failure Recovery (Read Throughput)

Failure Recovery (Latency)

Time (s) Time (s)

Geo-replicated Read Latency

If Single Object Put/Get Insufficient

• Test-and-Set, Append, Increment– Trivial to implement– Head alone can evaluate

• Multiple object transaction in same chain– Can still be performed easily– Head alone can evaluate

• Multiple chains– An agreement protocol (2PC) can be used– Only heads of chains need to participate– Although degrades performance (use

carefully!)

Summary

• CRAQ Contributions?– Challenges trade-off of consistency vs.

throughput

• Provides strong consistency

• Throughput scales linearly for read-mostly

• Support for wide-area deployments of chains

• Provides atomic operations and transactions

ThankYou

Questions?

jeff terrace michael j. freedman

object clean

v2read requestread request

freedmanobject storage

datav2v2v2v2write request

dirtyif latest version

wan effect craq

dirtyv1v1v1v1v1craqtwo

sfs asynchronous io

Documents

ian lee noaa/nws albany, ny dave fitzjarrald , jeff...

ben freedman on jews

2010.04.sandag_abm_joel freedman

getting to know freedman financial · 2019-09-30 ·...

11 quality measurement 101 barbra g. rabson, mhqp john...

jazzunlimitedmke.org · w/jack grassel & garerr waite villa...

gerald freedman

floorplans - concord pacific · floorplans. storage terrace...

congratulations, terrace grove and sunshine terrace!

2009 washington freedman 11a

terrace crc terrace, bc

los angeles superior court freedman & taitelman, llp bryan...

intro to bgp all-day tutorial avi freedman...

freedman web 2.0 final project

freedman defamation complaint

young and freedman:

freedman threehardestquestions pale (1)

freedman bgp101 n48

broucher-ctc · 2013. 12. 27. · terrace below terrace...

who is freedman seating?