borealis is a distributed stream processing system (dsps) based on aurora and medusa
Embed Size (px)
DESCRIPTION
Borealis is a distributed stream processing system (DSPS) based on Aurora and Medusa. HA Semantics and Algorithms. Contract-Based Load Management. - PowerPoint PPT PresentationTRANSCRIPT

Load Management and High Availability in BorealisMagdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team
MIT, Brown University, and Brandeis University
Borealis is a distributed stream processing system (DSPS) based on Aurora and Medusa
Contract-Based Load Management HA Semantics and Algorithms
Network Partitions
Approach: 1 - Offline, participants negotiate and establish bilateral contracts that:• Fix or tightly bound price per unit-load• Are private and customizable (e.g., performance, availability guarantees, SLA)
Properties:• Simple, efficient, and low overhead (provable small bounds)• Provable incentives to participate in mechanism• Experimental result: A small number of contracts and small price-ranges suffice to achieve acceptable allocation
A C
Approach: Favor availability. Use updates to achieve consistency• Use connection points to create replicas and stream versions• Downstream nodes
• Monitor upstream nodes• Reconnect to available upstream replica• Continue processing with minimal disruptions
Goal: Handle network partitions in a distributed stream processing system
p p
[p,p+e]0.8p
B’
B
Contractat p
Convex cost function
Offered load(msgs/sec)
Total cost(delay, $)
Task t moves from A to B if:• unit MC task t > p, at A• unit MC task t < p, at B
BA C
ACKTrim
Upstream backup lowest runtime overhead
BA C
B’Replay
Active Standby shortest recovery time
BA C
B’
ACK
Trim
Passive Standby most suitable for precise recovery
Goal: Streaming applications can tolerate different types of failure recovery:• Gap recovery: may lose tuples• Rollback recovery: produces duplicates but does not lose tuples• Precise recovery: takes over precisely from the point of failure Repeatable
Convergent
Deterministic
Filter, Map, Join
BSort, Resample, Aggregate
Union, operators with timeouts
BA C
B’
ACK
Checkpoint
D
A
CB
Goals: • Manage load through collaborations between autonomous participants • Ensure acceptable allocation where each node’s load is below threshold
Participant
Contract specifying that A will pay C, $p per unit of load
Challenges: Operator and processing non-determinism
2 - At runtime,Load moves only between participants that have a contractMovements are based on marginal costs:• Each participant has a private convex cost function• Load moves when it’s cheaper to pay partner than to process locally
Challenges: Incentives, efficiency, and customizationArbitrary
load(t)
MC(t) at A
Challenges: • Maximize availability• Minimize reprocessing• Maintain consistency
MC(t) at B

Load Management Demonstration Setup
A
CB
D
2) As node A becomes overloaded it sheds load to its partners B and C until system reaches acceptable allocation
A
CB
0.8p
3) Load increases at node B causing system overload
4) Node D joins the system. Load flows from node B to C and C to D until the system reaches acceptable allocation
All nodes process a network monitoring query over real traces of connection summaries
Group by IPcount60s
Group by IPcount distinct port60s Filter
> 10
Filter> 100
Group by IP prefix, sum60s Filter
> 100
Connectioninformation
Clusters of IPs that establish many connections
T
F
A
CB
p p
p
1) Three nodes with identical contracts and uneven initial load distribution Acceptable
allocation
Node A overloaded
A sheds load to B then to C
Acceptable allocation
System overload
Node D joins
Load flows from C to D and from B to C
A
B
C
C
B
D
IPs that establish many connections
IPs that connect over many ports
Query: Count the connections established by each IP over 60 sec and the number of distinct ports to which each IP connected

High Availability Demonstration Setup
Passive Standby 1) The four primaries, B0, C0, D0, and E0 run on one laptop
Identical queries traverse nodes that use different high availability approaches
3) We compare the runtime overhead of the approaches
A
B0 B1
C0 C1
D0 D1
E0 E1
Active Standby
Upstream Backup
Upstream Backup &Duplicate Elimination
B0’
C0’
D0’
E0’
2) All other nodes run on the other laptop
4) We kill all primaries at the same time
5) We compare the recovery time and the effects on tuple delay and duplication
Statically assigned secondary
Tuples received
E2E delay
Failure
Duplicate tuples
Failure
Active standby has highest runtime
overhead
Upstream backup has highest overhead during recovery
Passive standby adds most end-to-end delay
Passive Standby Active Standby
UB no dupsUpstream Backup

Network Partition Demonstration Setup
2) We unplug the cable connecting the laptops
3) Node C detects that node B has become unreachable
1) The initial query distribution crosses computer boundaries
A C
Laptop 2
Laptop 1
R
B
4) Node C identifies node R as reachable alternate replica:Output stream has the same name but a different version
5) Node C connects to node R and continues processing from the same point on the stream
6) Node C changes the version of its output stream
7) When partition heals, node C remains connected to R and continues processing uninterrupted
End-to-end tuple delay increases while C detects the network partition and re-connects to R
End-to-end tuple delay
Sequence nb of received tuples
Tuples received through B
Tuples received through R
No duplications and no losses after network partitions