applied research laboratory david e. taylor a scalable content-addressable network sylvia ratnasamy,...

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

A Scalable Content-Addressable Network

Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker

Proceedings of ACM Sigcomm 2001

CSE 679 Seminar Presentation


“P2P is the future, dude”• Witness the wildly popular “file-sharing” peer-to-peer

overlays such as Napster and Gnutella (a.k.a. copyright infringement rackets)– Introduced in mid-1999, Napster had approx. 50 million users

by December 2000

• Peer-to-peer networks provide a means of harnessing enormous pool of resources in contemporary end-systems– Napster was observed to exceed 7 TB of storage on a single day– Legitimate applications exist like SETI@home

• Paper introduces formal concepts to make peer-to-peer overlay networks scalable, fault-tolerant, and self-organizing


Content-Addressable Networks (CANs)• Primary scalability issue in peer-to-peer systems is the

indexing scheme used to locate the peer containing the desired content– Content-Addressable Network (CAN) is a scalable indexing

mechanism– Also a central issue in large scale storage management systems

• Current popular systems suffer from design choices that inhibit their scalability– Napster utilizes a centralized directory structure, actual file

transfer is peer-to-peer– Gnutella decentralizes file-location using a self-organizing

application-level mesh, requests are flooded within a certain scope


CAN Design Overview• Support basic hash table operations on key-value pairs

(K,V): insert, search, delete• CAN is composed of individual nodes• Each node stores a chunk (zone) of the hash table

– A subset of the (K,V) pairs in the table

• Each node stores state information about neighbor zones• Requests (insert, lookup, or delete) for a key are routed by

intermediate nodes using a greedy routing algorithm• Requires no centralized control (completely distributed)• Small per-node state is independent of the number of

nodes in the system (scalable)• Nodes can route around failures (fault-tolerant)


“Are you in the zone?”• Virtual d-dimensional

Cartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]

• Dynamically partitionedamong all nodes

• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P

• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or its neighboring

zones, route request to neighbor node in zone nearest P


Routing in a CAN• Follow straight line path through

the Cartesian space from source to destination coordinates

• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor

• Use greedy routing to neighbor closest to destination

• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:

dn

d 1

4


CAN Construction• Joining node locates a bootstrap

node using the CAN DNS entry– Bootstrap node provides IP addresses

of random member nodes

• Joining node sends JOIN request torandom point P in the Cartesian space

• Node in zone containing P splits thezone and allocates “half” to joining node

• (K,V) pairs in the allocated “half” aretransferred to the joining node

• Joining node learns its neighbor setfrom previous zone occupant– Previous zone occupant updates its neighbor set


Departure, Recovery and Maintenance• Graceful departure: node hands over its zone and the

(K,V) pairs to a neighbor• Network failure: unreachable node(s) trigger an

immediate takeover algorithm that allocate failed node’s zone to a neighbor– Detect via lack of periodic refresh messages– Neighbor nodes start a takeover timer initialized in proportion to

its zone volume– Send a TAKEOVER message containing zone volume to all of

failed node’s neighbors– If received TAKEOVER volume is smaller kill timer, if not

reply with a TAKEOVER message– Nodes agree on neighbor with smallest volume that is alive


More maintenance details…• In the case of multiple concentrated failures,

neighbors may need to perform an expanding ring search to build sufficient neighbor state prior to initiating the takeover algorithm

• Authors suggest a background zone-reassignment algorithm to prevent space fragmentation– Over time nodes may takeover zones that cannot be

merged with its own zone


CAN Improvements• CAN provides tradeoff between per-node state, O(d), and

path length, O(dn1/d)– Path length is measured in application level hops– Neighbor nodes may be geographically distant

• Want to achieve a lookup latency that is comparable to underlying IP path latency– Several optimizations to reduce lookup latency also improve

robustness in terms of routing and data availability

• Approach: reduce the path length, reduce the per-hop latency, and add load balancing

• Simulated CAN design on Transit-Stub (TS) topologies using the GT-ITM topology generator (Zegura, et. al.)


Adding Dimensions• Increasing the dimensions of the coordinate space

reduces the routing path length (and latency)– Small increase in the size

of the routing table ateach node

• Increase in number ofneighbors improvesrouting fault-tolerance– More potential next hop

nodes

• Simulated path lengthsfollow O(dn1/d)


Beaucoup de Réalités• Nodes can maintain multiple independent coordinate spaces

(realities)

• For a CAN with r realities:a single node is assigned r zonesand holds r independentneighbor sets– Contents of the hash table

are replicated for each reality

• Example: for three realities, a(K,V) mapping to P:(x,y,z) maybe stored at three different nodes– (K,V) is only unavailable when

all three copies are unavailable

– Route using the neighbor on the reality closest to (x,y,z)


Dimensions vs. Realities• Increasing the number of

dimensions and/or realities decreases path length and increases per-node state

• More dimensions has greater effect on path length

• More realities providesstronger fault-tolerance and increased data availability

• Authors do not quantify the different storage requirements– More realities requires replicating

(K,V) pairs


RTT Ratio & Zone Overloading• Incorporate RTT in routing metric

– Each node measures RTT to each neighbor

– Forward messages to neighbor with maximum ratio of progress to RTT

• Overload coordinate zones– Allow multiple nodes to share the same zone, bounded by a threshold

MAXPEERS– Nodes maintain peer state, but not additional neighbor state– Periodically poll neighbor for its list of peers, measure RTT to each peer,

retain lowest RTT node as neighbor– (K,V) pairs may be divided among peer nodes or replicated


Multiple Hash Functions• Improve data availability by using k hash functions to

map a single key to k points in the coordinate space• Replicate (K,V) and store

at k distinct nodes• (K,V) is only unavailable

when all k replicas aresimultaneouslyunavailable

• Authors suggest queryingall k nodes in parallel toreduce average lookup latency


“You’re from Jersey! What exit?”• Use landmarks for topologically-sensitive construction• Assume the existence of well-known machines like DNS servers• Each node measures its RTT

to each landmark– Order each landmark in order of

increasing RTT– For m landmarks:

m! possible orderings

• Partition coordinate spaceinto m! equal size partitions

• Nodes join CAN at randompoint in the partition correspondingto its landmark ordering

• Latency Stretch is the ratio of CANlatency to IP network latency


Other optimizations• Run a background load-balancing technique to

offload from densely populated bins to sparsely populated bins (partitions of the space)

• Volume balancing for more uniform partitioning– When a JOIN is received, examine zone volume and

neighbor zone volumes– Split zone with largest volume– Results in 90% of nodes of equal volume

• Caching and replication for “hot spot” management


Simulation Results• NS simulation of

Transit-Stub topologyn = 217 (130k nodes)


Related Work• Plaxton algorithm

– Originally developed for web-caching– Assumptions: administratively configured, stable hosts,

scale to thousands of hosts– Uses multi-level labels for each node– Route by incrementally resolving the destination label

• Domain Name System (DNS)• OceanStore• Publius• Peer-to-peer file sharing systems


General Comments• Extremely well-written paper• Logical flow• Thorough exploration of design space• Intuitive presentation of formal concepts

– Introduced a lot of metrics and parameters in a clear, concise way

• Present workable tradeoffs for neighbor state and path length

• Primary omission is the effects of replication– Many of the optimizations involve replication of (K,V) pairs– Under reasonable assumptions, what is the replication limit?– How does this affect the choice of design parameters?

applied research laboratory david e. taylor a scalable content-addressable network sylvia ratnasamy,...

Documents