self healing wide area network services bhavjit s walha ganesh venkatesh
Post on 21-Dec-2015
214 views
TRANSCRIPT
Self Healing Wide Area Network
Services
Bhavjit S WalhaGanesh Venkatesh
Layout
Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion
Motivation
Companies may have servers distributed over a wide area network Akamai Content Distribution Network. Distributed web-servers
Manual monitoring may not be feasible Centralized control – may lead to problems in
case of a network partition Typical server applications
May crash due software bugs Little state is retained Simple restart is thus sufficient
Motivation …
What if peers monitored each others health? In case a crash is detected - try and restart. No central monitoring station involved.
Loosely based on a worm Resilient to sporadic failures Spreads to uninfected nodes But
No backdoor involved May not always shift to new nodes
Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion
Medusa
All nodes a part of a Multicast Group Each node is thus in touch with all other nodes
through Heatbeat messages. Nodes send regular updates to the multicast tree
All communication through reliable multicast In case a node goes down
Other nodes try to restart it Request for service sent to multicast group
Medusa Problems
Scalability Assumptions of reliable packet delivery State information shared with all nodes.
Reliable Multicast Assumes reliable delivery of packets to all nodes No explicit ACKs
The kill operations fail in case of a temporary break in Multicast tree.
Security No way of authenticating packets
Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion
Proposed solution
Nodes form peering relationships with only a subset of other nodes. Exchange Hello packets Scalable as the degree is fixed
No central control No dependence on reliable multicast
Distributed communication protocol Explicit ACKs for packets
Some super-nodes required to be up when booted
Power of Randomly-connected graphs graphs
Design
Each node continually sends Hello Packets to its peer nodes. Indicates everything is up and working
A timeout indicates something is wrong Application crash Network Partition
Aim at application crashes Application should be stateless No code transfer Remotely restartable SSH needed – A login account and distributed keys.
Initialization
3-5 super-nodes form a fully-connected connected graph. Are expected to be up all the time
All nodes have information about their IPs May be under manual supervision May have information about the topology Responsible for forwarding join requests to
other nodes
Remote start
SSH to a remote node to restart Remote (re)start attempted after Hello timeout. Current implementation requires keys to be distributed
beforehand Starts a small watchdog program which immediately
returns Checks if there is a another copy already running
Current implementation uses ps In case the application start fails, do nothing – wait for
retry to restart Possible extension: allow the service to spread
New node comes up…
Waits for others to contact it After timeout:
Send JoinRequest to a super-node with the number of peers needed.
Supernode forwards this request to other nodes AddRequest
Some node may ask new node to become its peer Add to neighbourList and send AddACK
Hello Can add to neighbourList if unsolicited Hello received Beneficial in case of a short temporary failures
After Request-timeout: Contact another super-node with another JoinRequest. Timeout can be dynamically specified in JoinRequestACK.
New node comes up…Random Walk. Request forwarded by super-node to 3 random nodes on
behalf of new node Each node forwards it to others
Decrease hop count by 1 each time If hop count = 0, check if it can support more nodes
YES! Send AddRequest to new node Add to neighbourList on receiving AddACK.
NO! Ignore the request
New node may already have found neighbours Due to duplicate joinRequest or repair of Network
partition New node thus replies to AddRequest with Die packet.
Shutdown
Critical to ensure that all nodes go down 3-way protocol
Send kill to target node Target node replies with die Send dieACK to target node.
kill used when multiple copies detected Possibly to balance load
die Reply to unsolicited Hello
No perfect solution in case of a network partition
Global Shutdown…
Secret killAll packet Sent by an external program for complete system
shutdown Forwarded to all neighbours
Node does not die until it receives a killACK from everyone Stops sending hellos immediately No further restart attempts Reply only to die, kill and killAll May send unnecessary traffic
Eventually time out on seeing zero neighbours.
Performance
Tested on 6 nodes in GradLab Hello interval: 5s Hello timeout: 22s Wait before joinRequest: 10s joinRequest timeout: 20s Hop count: 2 Initial degree request: 3 Super-nodes: 3
Preliminary tests on PlanetLab
Results
LAN No timeouts or packet losses observed No duplicate copies killAll works perfectly
Re-start latency: 22s Decreases after a number of restarts
Join latency: 15s PlanetLab
Re-start latency: 27s Join latency: 21s
Introduction Previous Work Issues Solution Preliminary results Problems and Future Extensions Conclusion
Limitations
Security The packets are not authenticated
Stray copies After a killAll there may be stray copies Harmless as they do not try to spread But: prevents another copy from running
No new nodes Node discovery Why should they be idle in first place? What to do when the original nodes come back up? Solution
Send regular updates to super-nodes Extra servers can be killed easily
Parameter tweaking
Hop count for Random Walk Connectivity
Min-degree to ensure connectivity Max-degree to spread the failure probability
Timeouts Request timeout
Depends on hop-count Hello timeout
Different for WAN & LAN Global timeout
In case of network partition Loss of Kill ACK packets
Conclusion
Maintaining High Availability does not always require central control
Achieving a global shutdown is problematic
Need to explore connectivity requirements to ensure a connected graph at all times.
Thank You !