applied research laboratory david e. taylor a scalable content-addressable network sylvia ratnasamy,...

20
Applied Research Laborato Applied Research Laborato David E. Taylor David E. Taylor A Scalable Content- Addressable Network Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker Proceedings of ACM Sigcomm 2001 CSE 679 Seminar Presentation

Upload: emma-west

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

A Scalable Content-Addressable Network

Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker

Proceedings of ACM Sigcomm 2001

CSE 679 Seminar Presentation

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

“P2P is the future, dude”• Witness the wildly popular “file-sharing” peer-to-peer

overlays such as Napster and Gnutella (a.k.a. copyright infringement rackets)– Introduced in mid-1999, Napster had approx. 50 million users

by December 2000

• Peer-to-peer networks provide a means of harnessing enormous pool of resources in contemporary end-systems– Napster was observed to exceed 7 TB of storage on a single day– Legitimate applications exist like SETI@home

• Paper introduces formal concepts to make peer-to-peer overlay networks scalable, fault-tolerant, and self-organizing

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Content-Addressable Networks (CANs)• Primary scalability issue in peer-to-peer systems is the

indexing scheme used to locate the peer containing the desired content– Content-Addressable Network (CAN) is a scalable indexing

mechanism– Also a central issue in large scale storage management systems

• Current popular systems suffer from design choices that inhibit their scalability– Napster utilizes a centralized directory structure, actual file

transfer is peer-to-peer– Gnutella decentralizes file-location using a self-organizing

application-level mesh, requests are flooded within a certain scope

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

CAN Design Overview• Support basic hash table operations on key-value pairs

(K,V): insert, search, delete• CAN is composed of individual nodes• Each node stores a chunk (zone) of the hash table

– A subset of the (K,V) pairs in the table

• Each node stores state information about neighbor zones• Requests (insert, lookup, or delete) for a key are routed by

intermediate nodes using a greedy routing algorithm• Requires no centralized control (completely distributed)• Small per-node state is independent of the number of

nodes in the system (scalable)• Nodes can route around failures (fault-tolerant)

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

“Are you in the zone?”• Virtual d-dimensional

Cartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]

• Dynamically partitionedamong all nodes

• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P

• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or its neighboring

zones, route request to neighbor node in zone nearest P

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Routing in a CAN• Follow straight line path through

the Cartesian space from source to destination coordinates

• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor

• Use greedy routing to neighbor closest to destination

• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:

dn

d 1

4

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

CAN Construction• Joining node locates a bootstrap

node using the CAN DNS entry– Bootstrap node provides IP addresses

of random member nodes

• Joining node sends JOIN request torandom point P in the Cartesian space

• Node in zone containing P splits thezone and allocates “half” to joining node

• (K,V) pairs in the allocated “half” aretransferred to the joining node

• Joining node learns its neighbor setfrom previous zone occupant– Previous zone occupant updates its neighbor set

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Departure, Recovery and Maintenance• Graceful departure: node hands over its zone and the

(K,V) pairs to a neighbor• Network failure: unreachable node(s) trigger an

immediate takeover algorithm that allocate failed node’s zone to a neighbor– Detect via lack of periodic refresh messages– Neighbor nodes start a takeover timer initialized in proportion to

its zone volume– Send a TAKEOVER message containing zone volume to all of

failed node’s neighbors– If received TAKEOVER volume is smaller kill timer, if not

reply with a TAKEOVER message– Nodes agree on neighbor with smallest volume that is alive

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

More maintenance details…• In the case of multiple concentrated failures,

neighbors may need to perform an expanding ring search to build sufficient neighbor state prior to initiating the takeover algorithm

• Authors suggest a background zone-reassignment algorithm to prevent space fragmentation– Over time nodes may takeover zones that cannot be

merged with its own zone

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

CAN Improvements• CAN provides tradeoff between per-node state, O(d), and

path length, O(dn1/d)– Path length is measured in application level hops– Neighbor nodes may be geographically distant

• Want to achieve a lookup latency that is comparable to underlying IP path latency– Several optimizations to reduce lookup latency also improve

robustness in terms of routing and data availability

• Approach: reduce the path length, reduce the per-hop latency, and add load balancing

• Simulated CAN design on Transit-Stub (TS) topologies using the GT-ITM topology generator (Zegura, et. al.)

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Adding Dimensions• Increasing the dimensions of the coordinate space

reduces the routing path length (and latency)– Small increase in the size

of the routing table ateach node

• Increase in number ofneighbors improvesrouting fault-tolerance– More potential next hop

nodes

• Simulated path lengthsfollow O(dn1/d)

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Beaucoup de Réalités• Nodes can maintain multiple independent coordinate spaces

(realities)

• For a CAN with r realities:a single node is assigned r zonesand holds r independentneighbor sets– Contents of the hash table

are replicated for each reality

• Example: for three realities, a(K,V) mapping to P:(x,y,z) maybe stored at three different nodes– (K,V) is only unavailable when

all three copies are unavailable

– Route using the neighbor on the reality closest to (x,y,z)

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Dimensions vs. Realities• Increasing the number of

dimensions and/or realities decreases path length and increases per-node state

• More dimensions has greater effect on path length

• More realities providesstronger fault-tolerance and increased data availability

• Authors do not quantify the different storage requirements– More realities requires replicating

(K,V) pairs

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

RTT Ratio & Zone Overloading• Incorporate RTT in routing metric

– Each node measures RTT to each neighbor

– Forward messages to neighbor with maximum ratio of progress to RTT

• Overload coordinate zones– Allow multiple nodes to share the same zone, bounded by a threshold

MAXPEERS– Nodes maintain peer state, but not additional neighbor state– Periodically poll neighbor for its list of peers, measure RTT to each peer,

retain lowest RTT node as neighbor– (K,V) pairs may be divided among peer nodes or replicated

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Multiple Hash Functions• Improve data availability by using k hash functions to

map a single key to k points in the coordinate space• Replicate (K,V) and store

at k distinct nodes• (K,V) is only unavailable

when all k replicas aresimultaneouslyunavailable

• Authors suggest queryingall k nodes in parallel toreduce average lookup latency

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

“You’re from Jersey! What exit?”• Use landmarks for topologically-sensitive construction• Assume the existence of well-known machines like DNS servers• Each node measures its RTT

to each landmark– Order each landmark in order of

increasing RTT– For m landmarks:

m! possible orderings

• Partition coordinate spaceinto m! equal size partitions

• Nodes join CAN at randompoint in the partition correspondingto its landmark ordering

• Latency Stretch is the ratio of CANlatency to IP network latency

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Other optimizations• Run a background load-balancing technique to

offload from densely populated bins to sparsely populated bins (partitions of the space)

• Volume balancing for more uniform partitioning– When a JOIN is received, examine zone volume and

neighbor zone volumes– Split zone with largest volume– Results in 90% of nodes of equal volume

• Caching and replication for “hot spot” management

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Simulation Results• NS simulation of

Transit-Stub topologyn = 217 (130k nodes)

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

Related Work• Plaxton algorithm

– Originally developed for web-caching– Assumptions: administratively configured, stable hosts,

scale to thousands of hosts– Uses multi-level labels for each node– Route by incrementally resolving the destination label

• Domain Name System (DNS)• OceanStore• Publius• Peer-to-peer file sharing systems

Applied Research LaboratoryApplied Research LaboratoryDavid E. TaylorDavid E. Taylor

General Comments• Extremely well-written paper• Logical flow• Thorough exploration of design space• Intuitive presentation of formal concepts

– Introduced a lot of metrics and parameters in a clear, concise way

• Present workable tradeoffs for neighbor state and path length

• Primary omission is the effects of replication– Many of the optimizations involve replication of (K,V) pairs– Under reasonable assumptions, what is the replication limit?– How does this affect the choice of design parameters?