peer-to-peer networks

Peer-to-Peer NetworksPeer-to-Peer Networks

Distributed Algorithms for P2P

Distributed Hash Tables

P. FelberP. [email protected]

http://www.eurecom.fr/~felber/

Peer-to-Peer Networks — P. Felber 2

AgendaAgenda What are DHTs? Why are they useful? What makes a “good” DHT design Case studies

Chord Pastry (locality) TOPLUS (topology-awareness)

What are the open problems?


What is P2P?What is P2P? A distributed system system

architecturearchitecture No centralized control Typically many nodes, but

unreliable and heterogeneous Nodes are symmetric in function Take advantage of distributed,

shared resources (bandwidth, CPU, storage) on peer-nodes

Fault-tolerant, self-organizing Operate in dynamic

environment, frequent join and leave is the norm

InternetInternet


P2P Challenge: Locating ContentP2P Challenge: Locating Content

Simple strategy: expanding ring search until content is found If r of N nodes have copy, the expected search cost is at least

N / r, i.e., O(N) Need many copies to keep overhead small

Who has this paper?Who has

this paper?

I have itI have it

I have itI have it


Directed SearchesDirected Searches Idea

Assign particular nodes to hold particular content (or know where it is)

When a node wants this content, go to the node that is supposes to hold it (or know where it is)

Challenges Avoid bottlenecks: distribute the responsibilities

“evenly” among the existing nodes Adaptation to nodes joining or leaving (or failing)

Give responsibilities to joining nodes Redistribute responsibilities from leaving nodes


Idea: Hash TablesIdea: Hash Tables A hash table associates data

with keys Key is hashed to find bucket

in hash table Each bucket is expected to

hold #items/#buckets items In a Distributed Hash Table

(DHT), nodes are the hash buckets Key is hashed to find

responsible peer node Data and load are balanced

across nodes

key pos

00

hash functionhash function

11

22

N-1N-1

33...

xx

yy zz

lookuplookup (key) → datainsert (key, data)lookuplookup (key) → datainsert (key, data)

““Beattles”Beattles” 22

hash table

hash bucket

h(key)%Nh(key)%N

0

1

2

...

node

key poshash functionhash function

lookuplookup (key) → datainsert (key, data)lookuplookup (key) → datainsert (key, data)

““Beattles”Beattles” 22h(key)%Nh(key)%N

N-1


DHTs: ProblemsDHTs: Problems Problem 1 (dynamicity):Problem 1 (dynamicity): adding or removing nodes

With hash mod N, virtually every key will change its location!

h(k) mod m ≠ h(k) mod (m+1) ≠ h(k) mod (m-1)

Solution:Solution: use consistent hashing Define a fixed hash space All hash values fall within that space and do not

depend on the number of peers (hash bucket) Each key goes to peer closest to its ID in hash space

(according to some proximity metric)


DHTs: Problems (cont’d)DHTs: Problems (cont’d) Problem 2 (size):Problem 2 (size): all nodes must be known to

insert or lookup data Works with small and static server populations

Solution:Solution: each peer knows of only a few “neighbors” Messages are routed through neighbors via

multiple hops (overlay routing)


What Makes a Good DHT DesignWhat Makes a Good DHT Design For each object, the node(s) responsible for that object should

be reachable via a “short” path (small diametersmall diameter) The different DHTs differ fundamentally only in the routing

approach The number of neighbors for each node should remain

“reasonable” (small degreesmall degree) DHT routing mechanisms should be decentralized (no single no single

point of failure or bottleneckpoint of failure or bottleneck) Should gracefully handle nodes joining and leavinggracefully handle nodes joining and leaving

Repartition the affected keys over existing nodes Reorganize the neighbor sets Bootstrap mechanisms to connect new nodes into the DHT

To achieve good performance, DHT must provide low stretchlow stretch Minimize ratio of DHT routing vs. unicast latency


DHT InterfaceDHT Interface Minimal interface (data-centric)

Lookup(key) Lookup(key) IP address IP address

Supports a wide range of applications, because few restrictions Keys have no semantic meaning Value is application dependent

DHTs do notnot store the data Data storage can be build on top of DHTs

Lookup(key) Lookup(key) data data

Insert(key, data)Insert(key, data)


DHTs in ContextDHTs in Context

TransportTransport

DHTDHT

Reliable Block StorageReliable Block Storage

File SystemFile System

Communication

LookupRouting

StorageReplicationCaching

Retrieve and store filesMap files to blocks

CFS

DHash

Chord

TCP/IP

receivesend

lookup

load_blockstore_block

load_filestore_file

User ApplicationUser Application


DHTs Support Many ApplicationsDHTs Support Many Applications File sharing [CFS, OceanStore, PAST, …] Web cache [Squirrel, …] Censor-resistant stores [Eternity, FreeNet, …] Application-layer multicast [Narada, …] Event notification [Scribe] Naming systems [ChordDNS, INS, …] Query and indexing [Kademlia, …] Communication primitives [I3, …] Backup store [HiveNet] Web archive [Herodotus]


DHT Case StudiesDHT Case Studies Case Studies

Chord Pastry TOPLUS

Questions How is the hash space divided evenly among

nodes? How do we locate a node? How does we maintain routing tables? How does we cope with (rapid) changes in

membership?


Chord (MIT)Chord (MIT) Circular m-bit ID space

for both keys and nodes Node ID = SHA-1(IP

address) Key ID = SHA-1(key) A key is mapped to the

first node whose ID is equal to or follows the key ID Each node is responsible

for O(K/N) keys O(K/N) keys move when

a node joins or leaves

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

K30

K24

K10

K38

K54

2m-1 0


Chord State and Lookup (1)Chord State and Lookup (1) Basic Chord: each node

knows only 2 other nodes on the ring Successor Predecessor (for ring

management)

Lookup is achieved by forwarding requests around the ring through successor pointers Requires O(N) hops

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

K54

2m-1 0

lookup(K54)


Chord State and Lookup (2)Chord State and Lookup (2) Each node knows m

other nodes on the ring Successors: finger i of n

points to node at n+2i (or successor)

Predecessor (for ring management)

O(log N) state per node Lookup is achieved by

following closest preceding fingers, then successor O(log N) hops

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

K54

2m-1 0

N8+1

N8+2

N8+4

N8+8

N8+16

N8+32

N14

N14

N14

N21

N32

N42

Finger table

+32

+16 +8

+4

+2

+1lookup(K54)


Chord Ring ManagementChord Ring Management For correctness, Chord needs to maintain the

following invariants For every key k, succ(k) is responsible for k Successor pointers are correctly maintained

Finger table are not necessary for correctness One can always default to successor-based lookup Finger table can be updated lazily


Joining the RingJoining the Ring Three step process:

Initialize all fingers of new node Update fingers of existing nodes Transfer keys from successor to new node


Joining the Ring Joining the Ring —— Step 1 Step 1 Initialize the new node finger table

Locate any node n in the ring Ask n to lookup the peers at j+20, j+21, j+22… Use results to populate finger table of j


Joining the Ring Joining the Ring —— Step 2 Step 2 Updating fingers of

existing nodes New node j calls update

function on existing nodes that must point to j Nodes in the ranges

[j-2i , pred(j)-2i+1] O(log N) nodes need to

be updated

N1

N8

N14

N32

N21

N38

N42

N48

N51

N56

m=6m=6

2m-1 0

N8+1

N8+2

N8+4

N8+8

N8+16

N8+32

N14

N14

N14

N21

N32

N42

+16

N28

N28

-16

12

6


Joining the Ring Joining the Ring —— Step 3 Step 3 Transfer key responsibility

Connect to successor Copy keys from successor to new node Update successor pointer and remove keys

Only keys in the range are transferred

N32

N21

N28

K30 K24

N32

N21

N28

K24

K30

K24

N32

N21

N28

K24

K30

N32

N21

K24

K30


StabilizationStabilization Case 1: finger tables are reasonably fresh Case 2: successor pointers are correct, not fingers Case 3: successor pointers are inaccurate or key

migration is incomplete — MUST BE AVOIDED! Stabilization algorithm periodically verifies and

refreshes node pointers (including fingers)Basic principle (at node n):

x = n.succ.predif x ∈ (n, n.succ)

n = n.succnotify n.succ

Eventually stabilizes the system when no node joins or fails

N32

N21

N28


Dealing With FailuresDealing With Failures Failure of nodes might

cause incorrect lookup N8 doesn’t know correct

successor, so lookup of K19 fails

Solution: successor list Each node n knows r

immediate successors After failure, n knows first

live successor and updates successor list

Correct successors guarantee correct lookups

N1

N8

N32

N21

N38

N42

N48

N51

N56

m=6m=6

2m-1 0

+16 +8

+4

+2

+1

N14

N18

K19

lookup(K19) ?


Dealing With Failures (cont’d)Dealing With Failures (cont’d) Successor lists guarantee correct lookup with

some probability Can choose r to make probability of lookup failure

arbitrarily small Assume half of the nodes fail and failures are

independent P(n.successor-list all dead) = 0.5r P(n does not break the Chord ring) = 1 - 0.5r P(no broken nodes) = (1 – 0.5r)N

r = 2log(N) makes probability = 1 – 1/N With high probability (1/N), the ring is not broken


Evolution of P2P SystemsEvolution of P2P Systems Nodes leave frequently, so surviving nodes must be

notified of arrivals to stay connected after their original neighbors fail

Take time t with N nodes Doubling time: time from t until N new nodes join Halving time: time from t until N nodes leave Half-life: minimum of halving and doubling time

Theorem:Theorem: there exist a sequence of joins and leaves such that any node that has received fewer than k notifications per half-life will be disconnected with probability at least (1 – 1/(e-1))k ≈ 0.418k


Chord and Network TopologyChord and Network Topology

Nodes numerically-closeare not topologically-close(1M nodes = 10+ hops)

Nodes numerically-closeare not topologically-close(1M nodes = 10+ hops)


Pastry (MSR)Pastry (MSR) Circular m-bit ID space

for both keys and nodes Addresses in base 2b with

m / b digits

Node ID = SHA-1(IP address)

Key ID = SHA-1(key) A key is mapped to the

node whose ID is numerically-closest the key ID

N0002

N0201

N0322

N2001

N1113

N2120

N2222

N3001

N3033

N3200

m=8m=8

K1320

K1201

K0220

K2120

K3122

2m-1 0b=2b=2


Pastry LookupPastry Lookup Prefix routing from A to B

At hth hop, arrive at node that shares prefix with B of length at least h digits

Example: 5324 routes to 0629 via5324 → 0748 → 0605 → 0620 → 0629

If there is no such node, forward message to neighbor numerically-closer to destination (successor)

5324 → 0748 → 0605 → 0609 → 0620 → 0629 O(log2b N) hops


Pastry State and LookupPastry State and Lookup For each prefix, a node

knows some other node (if any) with same prefix and different next digit

For instance, N0201N0201:N-N-: N1???, N2???, N3???N1???, N2???, N3???

N0N0: N00??, N01??, N03??N00??, N01??, N03??

N02N02: N021?, N022?, N023?N021?, N022?, N023?

N020N020: N0200, N0202, N0203N0200, N0202, N0203

When multiple nodes, choose topologically-closesttopologically-closest Maintain good locality

properties (more on that later)

N0002

N0201

N0322

N2001

N1113

N2120

N2222

N3001

N3033

N3200

m=8m=8

2m-1 0b=2b=2

N0122

N0212N0221

N0233

Routingtable

K2120

lookup(K2120)


Node ID 10233102Node ID 10233102Leaf setLeaf set

Routing TableRouting Table

Neighborhood setNeighborhood set

A Pastry Routing TableA Pastry Routing Table

000221210202212102 11 2230120322301203 3120320331203203

1130123311301233 1223020312230203 1302102213021022221003120310031203 1013210210132102 1032330210323302

3333

10222302102223021020023010200230 10211302102113021023032210230322 1023100010231000 10232121102321211023300110233001

1023312010233120102332321023323211

0022

1302102213021022 1020023010200230 1130123311301233 31301233313012330221210202212102 2230120322301203 3120320331203203 3321332133213321

1023303310233033 1023302110233021 1023312010233120 10233122102331221023300110233001 1023300010233000 1023323010233230 1023323210233232

< SMALLER< SMALLER LARGER >LARGER >

Contains thenodes that are

numericallyclosest to

local nodeMUST BE UP

TO DATE

b=2, so node IDis base 4 (16 bits)

m/b

ro

ws

Contains thenodes that are

closest tolocal node

according toproximity metric

2b-1 entries per row

Entries in the nth rowshare the first n digitswith current node[ common-prefix next-digit rest ]

nth digit of current node

Entries in the mth columnhave m as next digit

Entries with no suitablenode ID are left empty

b=2b=2m=16m=16


Pastry and Network TopologyPastry and Network Topology

Expected node distanceincreases with rownumber in routing table

Expected node distanceincreases with rownumber in routing table

Smaller and smallernumerical jumpsBigger and biggertopological jumps

Smaller and smallernumerical jumpsBigger and biggertopological jumps


X0629X0629

X joinsX joins

Join message

JoiningJoining

A5324A5324

X knows A(A is “close” to X)

X knows A(A is “close” to X)

C0605C0605

D0620D0620

B0748B0748

Route message to node numerically closest to X’s ID

Route message to node numerically closest to X’s ID

A’s neighborhood set

D’s leaf set

A0 — ????

B1 — 0???

C2 — 06??

D4 — 062?

0629’s routing table


LocalityLocality The joining phase preserves the locality property

First: A must be near X Entries in row zero of A’s routing table are close to A, A is close

to X X0 can be A0

The distance from B to nodes from B1 is much larger than distance from A to B (B is in A0) B1 can be reasonable choice for X1, C2 for X2, etc.

To avoid cascading errors, X requests the state from each of the node in its routing table and updates its own with any closer node

This scheme works “pretty well” in practice Minimize the distance of the next routing step with no sense of

global direction Stretch around 2-3


Node DepartureNode Departure Node is considered failed when its immediate

neighbors in the node ID space cannot communicate with it To replace a failed node in the leaf set, the node contacts

the live node with the largest index on the side of failed node, and asks for its leaf set

To repair a failed routing table entry Rdl, node contacts first

the node referred to by another entry Ril, id of the same

row, and ask for that node’s entry for Rdl

If a member in the M table, is not responding, node asks other members for their M table, check the distance of each of the newly discovered nodes, and update its own M table


11 21

2

3

1

4

3

2

CAN (Berkeley)CAN (Berkeley) Cartesian space (d-

dimensional) Space wraps up: d-torus

Incrementally split space between nodes that join

Node (cell) responsible for key k is determined by hashing k for each dimension

d=2d=2

hx(k)

hy(k)

insert(k,data)insert(k,data)

retrieve(k)retrieve(k)


CAN State and LookupCAN State and Lookup A node A only maintains

state for its immediate neighbors (N, S, E, W) 2d neighbors per node

Messages are routed to neighbor that minimizes Cartesian distance More dimensions means

faster the routing but also more state

(dN1/d)/4 hops on average

Multiple choices: we can route around failures

W

N

A

S

E

B

d=2d=2


CAN Landmark RoutingCAN Landmark Routing CAN nodes do not have a pre-defined ID Nodes can be placed according to locality

Use well known set of m landmarklandmark machines (e.g., root DNS servers)

Each CAN node measures its RTT to each landmark Orders the landmarks in order of increasing RTT: m!

possible orderings

CAN construction Place nodes with same ordering close together in the CAN To do so, partition the space into m! zones: m zones on x,

m-1 on y, etc. A node interprets its ordering as the coordinate of its zone


CAN and Network TopologyCAN and Network Topology

Use m landmarksto split space inm! zones

Use m landmarksto split space inm! zones

A;C;B

B;A;CB;C;ABA;B;C

A

C

C;B

;A

C;A;B

Nodes get randomzone in their zoneNodes get randomzone in their zone

Topologically-closenodes tend to be inthe same zone

Topologically-closenodes tend to be inthe same zone


Topology-AwarenessTopology-Awareness Problem

P2P lookup services generally do not take topology into account

In Chord/CAN/Pastry, neighbors are often not locally nearby

Goals Provide small stretchsmall stretch: route packets to their destination

along a path that mimics the router-level shortest-path distance

Stretch: DHT-routing / IP-routing

Our solution TOPLUSTOPLUS (TOPology-centric Look-Up Service) An “extremist design” to topology-aware DHTs


TOPLUS ArchitectureTOPLUS Architecture

Tier 1

Tier 2

Tier 3

IP Addresses

Tier 0

nn ... ... ...

Use IPv4 address rangeIPv4 address range (32-bits)for node IDs and key IDsUse IPv4 address rangeIPv4 address range (32-bits)for node IDs and key IDs

Group nodes in nested groupsusing IP prefixesIP prefixes: AS, ISP, LAN(IP prefix: contiguous addressrange of the form w.x.y.z/n)

Group nodes in nested groupsusing IP prefixesIP prefixes: AS, ISP, LAN(IP prefix: contiguous addressrange of the form w.x.y.z/n)

Assumption:Assumption: nodes with sameIP prefix are topologically closeAssumption:Assumption: nodes with sameIP prefix are topologically close


Node StateNode State

Tier 1

Tier 2

Tier 3

IP Addresses

Tier 0

H1

H0=I

H2

H3

S3

S2

S1

nn ... ... ...

Each node n is part of a series oftelescoping sets HHii with siblings SSii

Each node n is part of a series oftelescoping sets HHii with siblings SSii

Node n must know allallup nodesup nodes in inner groupNode n must know allallup nodesup nodes in inner group

Node n must know oneone delegatedelegatenodenode in each tier i set S ∈ Si

Node n must know oneone delegatedelegatenodenode in each tier i set S ∈ Si


Routing with XOR MetricRouting with XOR Metric To lookup key k, node n forwards the request to the

node in its routing table whose ID j is closest to k according to XOR metric Let j = j31j30...j0 — k = k31k30...k0

Refinement of longest-prefix match Note that closest ID is unique: d(j,k) = d(j’,k) ⇔ j = j’

Example (8 bits)k = 1001011010010110

j = 1010111011010110 d(j,k) = 25 = 3232

j’ = 1001000100101001 d(j’,k) = 24 + 23 + 22 + 21 + 20 = 3131

i

iii kjkjd 2),(

31

0


Prefix Routing LookupPrefix Routing Lookup

Tier 1

Tier 2

Tier 3

IP Addresses

Tier 0

nn ... ... ...

Compute 32-bits key 32-bits key kk (usinghash function)Compute 32-bits key 32-bits key kk (usinghash function)

Perform longest-prefix matchagainst entries in routing tableusing XOR metric

Perform longest-prefix matchagainst entries in routing tableusing XOR metric

Route message to node in innergroup with closest ID (accordingto XOR metric)

Route message to node in innergroup with closest ID (accordingto XOR metric)

kk


TOPLUS and Network TopologyTOPLUS and Network Topology

Smaller and smallernumerical andtopological jumps

Smaller and smallernumerical andtopological jumps

... ... ...

Always movecloser to thedestination

Always movecloser to thedestination


Group MaintenanceGroup Maintenance To join the system, a node n find its closest node n’ n copies the routing and inner-group tables of n’ n modifies its routing table to satisfy a “diversity”

property Requires that the delegate nodes of n and n’ are distinct

with high probability Allows us to find a replacement delegate in case of failure

Upon failure, update inner-group tables Lazy update of routing tables

Membership tracking within groups (local, small)


On-Demand CachingOn-Demand Caching

Tier 1

Tier 2

Tier 3

IP Addresses

Tier 0

nn ... ... ...

Cache data in group (ISP, campus)group (ISP, campus)with prefix w.x.y.z/rCache data in group (ISP, campus)group (ISP, campus)with prefix w.x.y.z/r

To look up k, create create k’k’=k with rfirst bits replaced by w.x.y.z/r(node responsible for k in cache)

To look up k, create create k’k’=k with rfirst bits replaced by w.x.y.z/r(node responsible for k in cache)

Extends naturally to multiplelevels (cache hierarchy)(cache hierarchy)Extends naturally to multiplelevels (cache hierarchy)(cache hierarchy)

kk

k’


Measuring TOPLUS StretchMeasuring TOPLUS Stretch Obtained prefix information from

BGP tables from Oregon, Michigan Universities Routing registries from Castify, RIPE

Sample of 1000 different IP address Point-to-point IP measurements using King TOPLUS distance: weighted average of all

possible paths between source and destination Weights: probability of a delegate to be in each group

TOPLUS stretch = TOPLUS distance / IP distance


ResultsResults Original Tree

250,562 distinct IP prefixes Up to 11 levels of nesting Mean stretch: 1.171.17

16-bit regrouping (>16 → 16) Aggregate small tier-1 groups Mean stretch: 1.191.19

8-bit regrouping (>16 → 8) Mean stretch: 1.281.28

Original+1: add one level with 256 8-bit prefixes Mean stretch: 1.91.9

Artificial, 3-tier tree Mean stretch: 2.322.32


TOPLUS SummaryTOPLUS Summary Problems

Non-uniform ID space (requires bias in hash to balance load)

Correlated node failures

Advantages Small stretch IP longest-prefix matching allows fast forwarding On-demand P2P caching straightforward to implement Can be easily deployed in a “static” environment (e.g., multi-

site corporate network) Can be used as benchmark to measure speed of other P2P

services


Other Issues: Hierarchical DHTsOther Issues: Hierarchical DHTs The Internet is organized as a hierarchy

Should DHT designs be flat?

Hierarchical DHTs: multiple overlays managed by possibly different DHTs (Chord, CAN, etc.) First, locate group responsible for key in top-level DHT Then, find peer in next-level overlay, etc.

By designating the most reliable peers as super-nodes (part of multiple overlays), number of hops can be significantly decreased

How can we deploy, maintain such architectures?


Hierarchical DHTs: ExampleHierarchical DHTs: Example

s1s1

s3s3

s4s4

s2s2

Top-level Chord Overlay

CANGroup

ChordGroup


Other Issues: DHT QueryingOther Issues: DHT Querying DHTs allow us to locate data very quickly...

Lookup(“Beatles/Help”) IP address

...but it only works for perfect matching Users tend to submit broad queries

Lookup(“Beatles/*”) IP address Queries may be inaccurate

Lookup(“Beattles/Help”) IP address

Idea: Index data using partial queries as keys Other approach: Fuzzy matching (UCSB)


Some Other IssuesSome Other Issues Better handling of failures

In particular, Byzantine failures: A single corrupted node may compromise the system

Reasoning with the dynamics of the system A large system may never achieve a quiescent

ideal state Dealing with untrusted participants

Data authentication, integrity of routing tables, anonymity and censor resistance, reputation

Traffic-awareness, load balancing


ConclusionConclusion DHT is a simple, yet powerful abstraction

Building block of many distributed services (file systems, application-layer multicast, distributed caches, etc.)

Many DHT designs, with various pros and cons Balance between state (degree), speed of lookup

(diameter), and ease of management

System must support rapid changes in membership Dealing with joins/leaves/failures is not trivial Dynamics of P2P network is difficult to analyze

Many open issues worth exploring

peer-to-peer networks

Documents

responsible peer nodedata

hash bucketskey

nodes responsible

distributed hash table

new nodes

hash tableeach bucket

nodeswith hash mod n

number of peers hash