introduction

Download Introduction

If you can't read please download the document

Upload: boyd

Post on 05-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron Presented by Yu Feng and Elizabeth Lynch. Introduction. Application-level multicast Goals Scalability Failure tolerance Low delay - PowerPoint PPT Presentation

TRANSCRIPT

  • Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

    Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron

    Presented byYu Feng and Elizabeth Lynch

  • IntroductionApplication-level multicastGoalsScalabilityFailure toleranceLow delayEffective use of network resources

  • PastryP2P location and routing substrateProvides:ScalabilityLarge numbers of groupsLarge numbers of multicast sourcesLarge numbers of members per groupSelf-organizationPeer-to-peer location and routingGood locality properties

  • ScribeApplication-level multicast infrastructureBuilt on top of PastryTakes advantage of Pastry propertiesRobustnessSelf-organizationLocalityReliability

  • nodeIdEach node is assigned 128-bit nodeIdnodeIds are uniformly distributedEach node maintains tables that map nodeIds to IP addresses(2^b-1)*[log(2^b)N] + l entriesO(log(2^b)N) messages required to update after group membership change

  • Routing GuaranteesA message and key will be routed to the live node whose nodeId is closest to the keyIn a network of N nodes, the average number of steps in a route to any node is less than log(2^b)NDelivery is guaranteed unless l/2 or more nodes with adjacent nodeIds fail

  • Routing TablesnodeIds and keys are treated as sequences of digits base 2^bEach node's routing table has [log(2^b)N] rows and 2^b 1 entries per rowEach entry in row n refers to a node whose nodeId matches the present node's nodeId in the first n digits but whose n+1th digit has one of 2^b 1 other possible valuesThe entry closest to the present node according to a distance metric is chosen

  • Leaf Setsl/2 closest larger and l/2 closest smaller nodeIds relative to present nodeIdEach node maintains IP addresses for its leaf set

  • Routing algorithmCurrent node forwards to a node whose nodeId has a prefix at least one digit (b bits) longer in common with the keyIf no such node is available, forward to a node with the same prefix length whose nodeId is closer to the key

  • LocalityProximity metricLocality properties relevant to ScribeShort routesAccording to simulations: 1.59 to 2.2 times distance directly between the source and destinationRoute convergenceAccording to simulations: average distance traveled by two messages sent to the same key is approximately equal to the distance between the two source nodes

  • Node AdditionNew node X picks a nodeIdX contacts nearby node AA routes special message with X as keyMessage is routed to a node Z with nodeId numerically closest to XIf X==Z, X must choose a new nodeIdX obtains leafset from ZX obtains ith row of routing table from ith node traversed from A to ZX notifies appropriate nodes that it is now alive

  • Node FailureNeighboring nodes in nodeId space periodically exchange keep-alive messagesIf a node is silent for a period of time, T, it is presumed failed.All members of the failed node's leaf set are notified and then remove the failed node from their leaf sets and update.

  • Node RecoveryContacts all the nodes in last known leaf setObtains their leaf setsUpdates its leaf setNotifies members of new leaf set

  • Pastry APInodeId=pastryInit(Credentials)Causes local node to join existing Pastry network or start a new oneroute(msg, key)Routes msg to the node with nodeId numerically closest to keysend(msg, IP-addr)Sends msg to the node at IP-addr

  • Required Pastry Functionsdeliver(msg, key)When msg is received and local node's nodeId is closest to key out of all live nodesWhen msg is received that was transmitted via send() to IP of local nodeforward(msg, key, nextId)Called just before msg is forwarded to node with nodeId=nextIdApplication can change msg content or nextId valueIf nextId=NULL, msg terminates at local nodenewLeafs(leafSet)Called whenever there's a change in the leaf set

  • Scribe OverviewMulticast application framework built on top of PastryAny Scribe node may create a groupOther nodes can join the group and multicast to all members of that groupBest effort delivery and does not guarantee ordered delivery

  • How?A group is formed by building a multicast tree through joining Pastry routes from each group member to a rendezvous point (root of the tree).Multicast messages are sent to rendezvous point for distributionPastry and Scribe are fully decentralizedDecisions are based on local informationProvides reliability and scalability

  • Multicast Tree Scribe creates a multicast tree rooted at the rendezvous point. Scribe nodes that are part of a multicast tree are called forwarders. They may or MAY NOT be a members of the group.Each forwarder contains a children table.There is an entry (IP address and nodeId) for each of its children in the multicast tree.

  • Scribe APIcreate(credentials, groupId)Creates a new group using the credentials to control future accessjoin(credentials, groupId, messageHandler)Join a group with the specified groupIdleave(credentials, groupId)Leave a group with the specified groupIdmulticast(credentials, groupId, message)Multicast the specified message to the group with specified groupId

  • Scribe ImplementationCreating a GroupA scribe node asks Pastry to route a CREATE message using the groupId as the key. [e.g., route(CREATE, groupId)]Pastry delivers the CREATE message to a node that has its nodeId numerically closest to the groupId.Scribes deliver method is invoked and adds the new groupId to a list of groups it already knows. In addition, it also checks the credentials to ensure the group can be created. This node becomes the rendezvous point for the newly created group.

  • Scribe ImplementationJoining a GroupAsks Pastry to route a JOIN message with the groupId as the key. [e.g., route(JOIN, groupId)]. The message is routed towards the rendezvous point.Each node along the route, Pastry invokes Scribes Forward method.Checks to see if it is a forwarder for the group.If it is a current forwarder for the group, then it adds the node as a child.If it is NOT a current forwarder for the group, then it creates a children table for the new group, adds the node as a child. Then it routes a JOIN message with groupId as key [e.g., route(JOIN, groupId)].Finally, it terminates route message it received form the source.

  • Scribe ImplementationLeaving a GroupIt records locally that it left the group.If there are no more children in its children table, it sends a LEAVE message to its parent node.The parent node repeats step 2 until a node with a non-empty children table is found after removing the source node.

  • Multicast a MessageLocate rendezvous point for the group. [e.g., route(MULTICAST, groupId)], and ask it to return its IP address.The source caches the IP address and uses it for future multicasts. If the rendezvous point changes or fails, it uses Pastry again to find the new rendezvous point.All multicast messages are sent from rendezvous point.

  • Scribe Implementation

  • Reliability of ScribeRepairing the TreePeriodically, each non-leaf node sends out a heartbeat message to all of its children. When a leaf node does not receive a heartbeat after a certain period of time, it sends a JOIN message with the groups identifier. Pastry will route the message to a new parent, thus fixing the multicast tree.

  • Reliability of ScribeFailure of Rendezvous PointThe state of rendezvous point is replicated across k closest nodes to the root node (Typical value of k is 5).These k nodes are all children of the root node.When a root node fails, its immediate children detect the failure and join again through pastry.Pastry routes the new join message to a new root (a live root with the numerically closest nodeId to the groupId), which takes over the role of the rendezvous point.

  • Reliability of ScribeChildren table entries are discarded unless the child node sends a explicit message stating it wants to remain in the table.Tree repair mechanism scales well:Fault detection is done by sending messages to a small number of nodesRecovery from faults is local and only a small number of nodes is involved (O(log2bN))

  • Scribe - Providing Additional GuaranteesScribe only provides reliable, ordered delivery of multicast messages only if the TCP connections do not fail.Scribe provides a simple mechanism to allow other applications to implement stronger reliability guarantees.forwardHandler(msg): Invoked by Scribe before the node forwards a multicast message to its children.joinHandler(msg): Invoked by Scribe after a new child is added to one of the nodes children tables. faultHandler(msg): Invoked by Scribe when a node suspects its parent is faulty.

  • Additional Reliability ExampleforwardHandler The root assigns a sequence number to each message Multicast messages are buffered by the root and by each node in the multicast tree. Messages are retransmitted after the multicast tree is repaired.faultHandler adds the last sequence number delivered by the node to the JOIN message that is sent out to repair the tree.joinHandler retransmits buffered messages numbers above n to the new child.

  • Experimental SetupRandomly generated network topology with 5050 routersScribe was run on 100,000 end nodes randomly assigned to routers with uniform distributionUsing different random seeds, ten different topologies were generatedResults are averaged over all ten topologiesExperimented with a wide range of group sizes and large number of groupsSize of group with rank r: gsize(r)=floor(N*r^(-1.25) + .5)Group membership selected randomly with uniform distribution

  • Delay PenaltyCompare delay between Scribe multicast and IP multicastMeasure distribution of delay to deliver a message to each member of a groupTwo metrics:RMD50% of groups less than 1.69Max = 4.26RAD50% of groups less than 1.68Max = 2

  • Node StressStress imposed by maintaining groups and handling forwarding packets and duplicate packets at the end node instead of on the routersMeasure the number of groups with non-empty children tables and the number of entries in children tablesIn our simulation with 1500 groupsNon-empty children tables per node: Avg=2.4, max=40Children table entries per node: Avg=6.2, max=1059

  • Link Stress ExperimentComputed link stress by counting the number of packets that are sent over each link when a message is sent to each of the 1500 groups.Total number of links is 1,035,295Total number of messages for Scribe is 2,489,824Total number of messages for IP multicast is 758,853Mean number of message per link:2.4 for Scribe0.7 for IP multicastMaximum Link Stress:4031 for Scribe950 for IP multicast

  • Bottleneck RemoverWhen a node detects it is overloaded, it selects the group that consumes the most resources. Then it chooses the child in this group that is farthest away.The parent then drops the child by sending it a message containing the children table for the group along with delays between each children and the parent.When the child receives the message it does the following:It measures the delay between itself and other child in the children table received.It then computes the delay between itself and the parent via each of the nodes.Finally, it sends a JOIN message to the node that provides the least combined delay.

  • Bottleneck Remover ResultsThis introduces potential for routing loops When a loop is detected, the node sends another JOIN message to generate a new random route.The bottleneck remover limits the number of entries for its children tables at a cost of increased link stress during join.Average link stress increases from 2.4 to 2.7 and maximum increases from 4031 to 4728.

  • Scalability with Many Small Groups50,000 Scribe nodes 30,000 Scribe group with 11 nodes per groupAverage number of children entries per node is 21.2 compared to a plain (nave) multicast average of only 6.6Average link stress:6.1 for Scribe1.6 for IP multicast2.9 for Nave multicastScribe entries are higher because it creates trees with long paths and no branching.

  • ConclusionScribe is a fully decentralized and large-scale application-level multicast infrastructure built on top of Pastry.Designed to scale to large number of groups, large group size, and supports multiple multicasting sources per group.Scribe and Pastrys randomized placement of nodes, groups, and multicast roots balances the load and the multicast tree.Scribe uses a best effort delivery scheme but can be extended to guarantee more strict multicast requirements. Experimental results show that Scribe can efficiently support large number of nodes, groups, and a wide range of group sizes compared to IP multicasting.

    Overlay network in the InternetHost only needs Internet connection, Pastry software, and proper credentials to join a Pastry networkN = total number of nodes in Pastry networkb = config param with default 4l = even int param with default 16 (usually 2^b)Proximity metric scalar value that reflects the distance between two nodes, e.g. RTTLocality properties with respect to the proximity metricShort routes measured by total distance traveled by messages along Pastry routesRoute convergence measured by distance traveled by two messages sent to the same key before their routes convergeUpdating is trivial since adjacent nodes have overlapping leaf setsInitialize all relevant stateCredentials are provided by the app and contain info to authenticate local node and securely join Pastry networkReturns the local node's nodeIdLeaf set changes for node additions, failures, or recoveriesCustom packet-level discrete-event simulatorModels link-delay but not queuing delay or packet loss, for scalabilityNo cross-traffic modeledRouters didn't run ScribeEach end node directly linked to its assigned routerNumber of groups = 1500 -> max able to simulateN=100,000 -> max able to simulateExponent 1.25 chosen to give min group size=11: typical of Instant MessagingMax group size (rank=1) = 100,000

    RMD Ratio of max delays between Scribe and IPRAD Ratio of avg delaysIn Scribe this is handled at the end nodes instead of at the router as in IP multicastScribe distributes forwarding load well over all nodesAchieves good scalability