distributed hash tables chord and dynamo costin raiciu, advanced topics in distributed systems...

Distributed Hash TablesChord and Dynamo

Costin Raiciu,Advanced Topics in Distributed Systems

18/12/2012

Motivation: file sharing

• Many users want to share files online• If a file’s location is known, downloading is

easy– The challenge is to find who stores the file we

want• Early attempts– Napster (centralized), Kazaa

• Gnutella (March 2000)– Completely decentralized

How should we fix Gnutella’s problems?

• Decouple storage from lookup– Gnutella: node only answers queries for nodes it

has locally• Requirements– Extreme scalability: millions of nodes– Load balance: spread load across nodes evenly– Availability: must cope with node churn (nodes

joining/leaving/failing)

Chord [Stoica et al, Sigcomm 2001]

• Opens a new body of research on “Distributed Hash Tables”– Together with Content Addressable Networks (also

Sigcomm 2001)• Most popular application: a Distributed Hash

Table (DHT)

Chord basics

• A single fundamental operation: lookup(key)– Given a key, find the node responsible for that key

How do we do this?

Consistent hashing

• Assign unique m-bit identifiers to both nodes and objects (e.g. files)– E.g. m=160, use SHA1– Node identifier: hash of IP address– Object identifier: hash of name.

• Split key space across all servers– Not necessary to store keys for the files you have!

Who is responsible for storing metadata relating to a given key?

Key assignment

• Identifiers are ordered in an identifier circle modulo 2m

• Key k is assigned to the first node whose identifier is equal to or follows (the identifier of) k in the identifier space. – This node is called the successor node of k (successor(k))– If identifiers are represented as a circle of numbers from 0

to 2m−1 then successor (k) is the first node clockwise from k

Consistent hashing example

Lookup

• Each node n maintains a routing table with (at most) m entries called the finger table

• The i th entry in the table at node n contains the identity of the first node(s) that succeeds n by at least 2i−1 on the circle– n.finger[i]= successor ( n + 2i-1 ), 1< i < m

Lookup (2)

• Each node stores information about only a small number of other nodes (log n)

• Nodes know more about nodes closely following them on the circle than about nodes farther away

Is there enough information in the finger table to find the successor of an arbitrary key?

How should we use finger pointers to guide the lookup?

Lookup algorithm

How many hops are required to find a key?

Node joins

• To maintain correctness, Chord maintains two invariants:– Each node’s successor is correctly maintained– For every key k, successor(k) is responsible for k

Node joins: detail

• Chord uses a predecessor pointer to walk counterclockwise – Maintains Chord ID and IP address of previous node– Why?

• When a node joins the network Chord:– Initializes the predecessor and fingers of node n;– Updates the fingers and predecessors of existing nodes

to reflect the addition of n– Notifies the higher layer software so that it can transfer

state associated with keys that n is now responsible for

Stabilization: Dealing with Concurrent Joins and Failures

• In practice Chord needs to deal with nodes joining the system concurrently and with nodes that fail or leave voluntarily

• Solution: Every node runs a stabilize process periodically– When n runs stabilize,it asks n’s successor for the

successor’s predecessor p, and decides whether p should be n ’s successor instead

– stabilize also notifies n’s successor of n’s existence, giving the successor the chance to change its predecessor to n

Implementing a Distributed Hash Table over Chord

• put(k, v) – lookup n, the node responsible for k and store v on n

• get(k) – lookup node responsible for k, return value

• How long does it take to join/leave Chord?– Fix: store on n and a few of its successors– Locally broadcast query

Other aspects of Distributed Hash Tables

• How do we deal with security?– Nodes that return wrong answers– Nodes that do not forward messages– …

Applications of Distributed Hash Tables?

• A whole body of research– Distributed Filesystems (Past, Oceanstore)– Distributed Search – None deployed. Why?

• Today: – Kademlia is used for “tracker-less” torrents

Amazon Dynamo [DeCandia et al, SOSP 2007]

(slides adapted from DeCandia et al)

Context• Want a distributed storage system to use as

support some of Amazon’s tasks:– best seller lists– shopping carts– customer preferences– session management– sales rank– product catalog

• Traditional databases scale poorly and have poor availability

Amazon Dynamo

• Requirements– Scale– Simple: key-value– Highly available– Guarantee Service Level Agreements (SLA)

Uses key-value store as abstraction

System Assumptions and Requirements

• Query Model– Read and write operations to a data item that is uniquely

identified by a key– No schema needed– Small Objects (<1MB) stored as blobs

• ACID Properties?– Atomicity and weaker consistency, durability

• Efficiency– Commodity hardware– Mind the SLA!

• Other Assumptions– Environment is friendly (no security issues)

Amazon Request Handling99.9% SLAs

Design Considerations

• Sacrifice strong consistency for availability– Why are consistency and availability at odds?

• Optimistic replication increases availability– Allow disconnected operations– This may lead to concurrent updates to the same object:

conflict– When to perform conflict resolution?

• Delaying writes unacceptable (e.g. shopping cart update)• Solve conflicts during read instead of write, i.e. “always writeable”.• Who resolves conflict?

– App – e.g. merge shopping cart contents– Datastore – last write wins.

Other design considerations

• Incremental scalability• Symmetry• Decentralization• Heterogeneity

Partitioning Algorithm

• Dynamo uses consistent hashing

• Consistent hashing issues:– Load imbalance– Dealing with heterogeneity

• ”Virtual Nodes”: Each node can be responsible for more than one virtual node.

Advantages of using virtual nodes

• If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.

• When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

• The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

Replication

• Each data item is replicated at N hosts– N is specified per instance

• “preference list”: the N-1 successors of the key that store it.

Data Versioning

• A put() call may return to its caller before the update has been applied at all the replicas

• A get() call may return many versions of the same object

• Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future.

• Solution: uses vector clocks in order to capture causality between different versions of the same object.

Vector Clock

• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated

with one vector clock.• If the counters on the first object’s clock are

less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Vector clock example

Execution of get () and put () operations

1. Route its request through a generic load balancer that will select a node based on load information.

2. Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.

Quorum systems

• We are balancing writes and reads over N nodes

• How do we make sure a read sees the latest write?– Write on all nodes, wait for reply from all; read

from any node– Or write to one, read from all

• Quorum systems: write to W, read from R such that W+R>N

Dynamo uses Sloppy Quorum

• Send write to all nodes– Return when W reply

• Send read to all nodes– Return result(s) when R reply

• What did we lose?.

Hinted handoff

• Assume N = 3. When B is temporarily down or unreachable during a write, send replica to E.

• E’s metadata hints that the replica belongs to A and it will deliver it to A when A is recovered.

• Write will succeed as long as where are W nodes (any) available in the system

Dynamo membership

• Membership changes are manually configured– Gossip based protocol propagates

membership information– Everyone node knows about every other

node’s range• Failures are detected by each node via

timeouts– Enable hinted handoffs, etc.

Implementation

• Java• Local persistence component allows for

different storage engines to be plugged in:– Berkeley Database (BDB) Transactional Data Store:

object of tens of kilobytes

– MySQL: object of > tens of kilobytes

– BDB Java Edition, etc.

Evaluation

distributed hash tables chord and dynamo costin raiciu, advanced topics in distributed systems...

Documents

successor node

node joins

node responsible

nodes log n nodes

nodes successor

decentralized slide

sha1 node identifier

lookup algorithm slide