reza sherafat kazemzadeh * hans-arno jacobsen university of toronto ieee srds october 6, 2011

36
Partition-Tolerant Distributed Publish/Subscribe Systems Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011

Upload: dante-hollier

Post on 14-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Partition-Tolerant Distributed Publish/Subscribe Systems

Reza Sherafat Kazemzadeh *Hans-Arno JacobsenUniversity of Toronto

IEEE SRDSOctober 6, 2011

msrg.orgSRDS 2011 2

London

Toronto

Trader 1

Content-Based Publish/Subscribe

Pub/Sub

S

SS SS

S

S

PPublish PP

P

sub = [STOCK=IBM]

sub= [CHANGE>-8%]

NY

Trader 2Stock quote dissemination

msrg.orgSRDS 2011 3

Fault-tolerance (against concurrent failures):

Broker crashes Link failures Recoveries

Reliability: Publications match subscriptions Per-source in-order delivery After some point in time Exactly-

once delivery(no loss, no duplicates)

Assumptions: Clients are light-weight (broker

network is responsible for reliability) A time t after which the system

provides guaranteed delivery

P/S

GoalsPub

Sub Sub

Reliabledelivery

msrg.orgSRDS 2011 4

System Architecture

Tree dissemination networks: One path from source to destination Pros:

Simple, loop-free Preserves publication order

(difficult for non-tree content-based P/S) Cons:

Trees are highly susceptible to failures

Primary tree: Initial spanning tree that is formedas brokers join the system

Maintain neighborhood knowledge Allows brokers to reconfigure overlay

after failures on the fly

∆-Neighborhood knowledge: ∆ is configuration parameterensures handling ∆-1 concurrent failures (worst case)

Knowledge of other brokers within distance ∆ Join algorithm

Knowledge of routing paths within neighborhood Subscription propagation algorithm

3-neighborhood

2-neighborhood

1-neighborhood

msrg.orgSRDS 2011 5

Subscription

Propagation

Publication Forwarding

Broker/Link Recovery

Overlay Manageme

nt

Single chain

Overview of the Approach

SRDS 2011 6

Overlay Management Alg.Maintains end-to-end connectivity despite failures in the overlay.

msrg.orgSRDS 2011 7

When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links.

Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers

Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. Only active connections are used by brokers

Overlay Partitions

ABCDEF SP D

pid1=<C, {D}>

Partition detectorBrokers on the partitionBrokers beyond

the partitionBrokers onthe partition

Active connection to E

?

x

msrg.orgSRDS 2011 8

What if there are more failures, particularly adjacent failures?

If ∆ is large enough the same process can be used for larger partitions.

Overlay Partitions – 2 Adjacent Failures

ABCDEF SP D

pid1=<C, {D}>

Brokers beyondthe partition

Brokers onthe partition

E

+ pid2=<C, {D, E}>

Active connection to F

msrg.orgSRDS 2011 9

Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay.

Brokers “on” and “beyond” the partition are unreachable.

No new active connection

Overlay Partitions - ∆ Adjacent Failures

ABCDEF SP D

pid1=<C, {D}>

Brokers beyondthe partition

Brokers onthe partition

E

pid2=<C, {D, E}>

F

+ pid3=<C, {D, E, F}>

SRDS 2011 11

Subscription Propagation Alg.How correct routing tables are maintained despite overlay partitions?

msrg.orgSRDS 2011 12

Subscription Propagation Algorithm

Establishes end-to-end routing state among brokers while taking into account overlay partitions.

Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. Primary tree is the “basis” of constructing end-to-end forwarding

paths.

Each subscription contains:

SUB = <Id, Predicates, Anchor>

Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”]

Anchor is a reference to brokers along the propagation path of the subscription

msrg.orgSRDS 2011 13

Subscription Propagation in Absence of Overlay Partitions

Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber

Accepting a subscription is to add it into routing tables Only after confirmations are received, a subscription is

accepted (i.e., will be used for matching) Observation: Matching publications are delivered to a

subscriber once its local broker accepts subscription

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

s.anchor

☑conf

☑conf

☑conf

☑conf

☑conf

∆ hops ∆ hops

msrg.orgSRDS 2011 14

Subscription Propagation in Presence of overlay Partitions

Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation

Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed.

Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table.

ABCDEP SConfirmation

s

Subscriptions

CD B

s

☑conf

s

s

s

conf

☑conf*

☑conf*

☑* pid tag is alsostored alongwith s* Tag conf with pid

SRDS 2011 15

Publication Forwarding Alg.How accepted subscriptions and their partition tags are used to achieve reliable delivery?

msrg.orgSRDS 2011 16

Publication Forwarding in Absence of Overlay Partitions

Forwarding only uses subscriptions accepted brokers.

Steps in forwarding of publication p: Identify anchor of accepted subscriptions that match p Determine active connections towards matching subscriptions’

anchors Send p on those active connections and wait for confirmations If there are local matching subscribers, deliver to them If no downstream matching subscriber exists, issue confirmation

towards P Once confirmations arrive, discard p and send a conf towards P

PublicationsABCDEP S

Subscriptions

p

☑ ☑ ☑ ☑ ☑ ☑

CE

p p p p p

Deliver to localsubscribers

confconfconfconfconfconf

p

msrg.orgSRDS 2011 17

Publication Forwarding in Presence of Overlay Partitions Key forwarding invariant to ensure reliability: we ensure

that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription.

Case1: Sub s has been accepted with no pid. It is safe to bypass intermediate brokers

conf

conf

conf

Publications

ABCDEP S

Subscriptions

p

C BD

☑ ☑ ☑ ☑ ☑ ☑ ☑

p p

Deliver to localsubscribersconf

p

msrg.orgSRDS 2011 19

Case2: Sub s has been accepted with some pid.

Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:

It is safe to deliver publications from sources beyond the partition.

Publication Forwarding (cont’d)

conf

conf

conf

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

Depending on when this link has been establishedeither recovery or subscription propagation ensure

C accepts s prior to receiving p

conf

p

msrg.orgSRDS 2011 20

Publication Forwarding (cont’d) Case2: Subscription s is accepted with some pid

tags.

Case 2b: Publisher’s broker has not accepted s:

It is unsafe to deliver publications from this publisher (invariant).

Publications

ABCDEP S

Subscriptionsp

☑ ☑*

p p*

s was acceptedat S with the same pid tag

p p

p

Tag with pid

SRDS 2011 21

EvaluationUsing a mix of simulation and experimental deployments on large-scale testbed.

msrg.orgSRDS 2011 22

Simulation Results

SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆

∆=4∆=3

∆=1∆=2

Size of ∆-neighborhoods

Network size of 1000 Broker fanout of 3

∆=1∆=2∆=3∆=4

msrg.orgSRDS 2011 23

Impact of Failures on End-to-End Broker Reachability

Using a graph simulation tool.

Overlay setup:▪ Network size 1000 Brokers with

fanout=3 Failure injection:▪ Failures: up to 100 brokers▪ We randomly marked a given

number of nodes as failed Measurements:▪ We counted the number of end-

to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain.

∆=3∆=4

∆=2∆=1

∆=1

∆=4

msrg.orgSRDS 2011 24

Experimental Deployments:Impact of Failures on Pub Delivery

500 brokers deployed on 8-core machines in a cluster: Network setup: Overlay fanout=3. We measured aggregate pub. delivery count in an interval of

120s Expected bar is number of publications that must be delivered

despite failures (this excludes traffic to/from failed brokers).

∆=1

∆=3∆=2

∆=4

Expected

∆=4∆=3

∆=1

∆=1

msrg.orgSRDS 2011 25

Conclusions

We developed a reliable P/S system that tolerateconcurrent broker and link failures:

Configuration parameter ∆ determines level of resiliency against failures (in the worst case).

Dissemination trees augmented with neighborhood knowledge.

Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures.

We studied the performance of the system when numberof failures far exceeds ∆:

A small value for ∆ ensures good connectivity.

SRDS 2011 26

Questions…Thanks for your attention!

msrg.orgSRDS 2011 27

Challenges

Why “end-to-end” principle does not work? Publishers and subscribers are decoupled and

unaware of each other.

Routing paths are established by dynamicallyinserted subscriptions Subscription propagation is also subject to

broker/link failure.

Selective delivery makes in-order deliveryover redundant path difficult Subscribers are only interested in a subset of

what is published.

Responsibility on P/S

messaging system

Subscription propagation algorithm

We use a special form

of tree dissemination

msrg.orgSRDS'09 28

A copy is first preserved on disk

Intermediate hops send an ACK to previous hop after preserving

ACKed copies can be dismissed from disk

Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery This ensures reliable delivery but may cause delays while the machine is

down

Store-and-Forward

P P PPFromhere

Tohere

ackackack

msrg.orgSRDS'09 29

Use a mesh network to concurrently forward msgs on disjoint paths

Upon failures, the msg is delivered using alternative routes

Pros: Minimal impact on delivery delay

Cons: Imposes additional traffic & possibility of duplicate delivery

Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]

PPPP

Fromhere

Tohere

msrg.orgSRDS'09 30

Replicas are grouped into virtual nodes Replicas have identical routing information

Replica-based Approach [Bhola , et al., DSN 2002]

PhysicalMachines

Virtual node

msrg.orgSRDS'09 31

Replicas are grouped into virtual nodes Replicas have identical routing information

We compare against this approach

PP

PP

P

P

Virtual node

Replica-based Approach[Bhola , et al., DSN 2002]

msrg.orgSRDS 2011 32

Publication Forwarding (cont’d) Case2: Sub s has been accepted with

some pid.

Case 2b (Partition barrier): Publisher’s broker has also not accepted s

Publications

ABCDEP S

Subscriptions

p1

☑ ☑*

p1 p1*

s was acceptedat S with the same pid tag

p1 p1

p1

p1*

Tag with pid

p2 & p1

matches r & smatches s

☑r ☑r ☑r ☑r ☑r ☑r ☑ ☑r

R

msrg.orgSRDS 2011 33

Subscription Propagation with Partitions

Partition islands: Simply confirm (and accept) subscriptions over available If partition brokers are reachable from

the other side of the partition

Intuition: Publications from P may

only be lost if they arrive at B But this will not happen since

there is no link towards B from F Correctness proof argues on the precedence of acceptance

and creation of links

ABCDEP SP BC

Leadbroker

A

Confirmations

Subscriptions

☑ ☑ ☑ ☑

Will acceptduring recovery

msrg.orgSRDS 2011 34

Subscription Propagation with Partition Barriers

If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures

Lead broker “partially confirms” the subscription and tags the confirmation with the partition information Accepting brokers store the partition information along

with the subscription This ensures liveness

ABCEFG AP SCD

Leadbroker

B

Forward

Partialconf

☑* ☑*Δ hops

msrg.orgSRDS 2011 35

Publication Forwarding

Only accepted subscriptions are stored in SRT and used for matching

At each point in time, a broker has a number of connections to its nearest reachable neighbors This set of active connections may change

over time

Publication forwarding steps:1. Store publication in a FIFO internal

message queue2. Match and compute set of {from} for

subscriptions that match3. For each partially confirmed

subscription, tag the publication with the partition information

4. Send the publication to the closest reachable neighbors towards {from}

5. Once all confirmations arrive, discard publication and issue confirmation towards publisher

queuePPPP

S

S

S

P

(δ+1)-neighborhood

A

msrg.orgSRDS 2011 36

Network size of 1000 Broker fanout of 3

Network size of 1000 Broker fanout of 7

Evaluations

SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆

∆=4∆=3

∆=1∆=2

Size of ∆-neighborhoods

∆=4∆=3∆=2∆=1

Size of ∆-neighborhoods

msrg.orgSRDS 2011 37

Overlay Links Management Sessions: FIFO

communication links between brokers.

Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B.

∆ = 2

Primary tree

msrg.orgSRDS 2011 38

Agenda

Challenges of reliability and fault-tolerance in P/S

Our approach Topology neighborhood knowledge Subscription propagation Publication forwarding Recovery procedure

Evaluation results

Conclusions