ravana: controller fault-tolerance in sdn
TRANSCRIPT
![Page 1: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/1.jpg)
||
Software Defined Networking: The Data Centre Perspective Seminar
Michel Kaporin (Mišels Kaporins)
13.05.2016 1
Ravana: Controller Fault-Tolerance in SDN
Michel Kaporin
![Page 2: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/2.jpg)
||
Introduction
Controller Failures in SDN
Ravana Protocol
Correctness
Performance Optimisations
Implementation
Performance Evaluation
13.05.2016Michel Kaporin 2
Agenda
![Page 3: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/3.jpg)
|| 13.05.2016Michel Kaporin 3
Introduction
![Page 4: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/4.jpg)
||
Single controller can become a single point of failure
Failures lead to
Service disruptions
Incorrect packet processing
Ideal model:
Fault-free SDN
13.05.2016Michel Kaporin 4
Single Controller Lacks Reliability
![Page 5: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/5.jpg)
||
Apply established distributed systems techniques:
Replicate durable state:
Two-phase commit or
Primary/backup methods with journaling and rollback
Or model controller as a replicated state machine (RSM)
13.05.2016Michel Kaporin 5
Potential Solution
![Page 6: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/6.jpg)
||
Must ensure that switch state is handled consistently during failures
13.05.2016Michel Kaporin 6
More to a Solution
Not easy!
Switch semantics are different:
How to process events and execute commands under failures?
How to reason about switch state?
Rollback packets?
![Page 7: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/7.jpg)
|| 13.05.2016Michel Kaporin 7
Controller Failures in SDN
![Page 8: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/8.jpg)
|| 13.05.2016Michel Kaporin 8
Total Event Ordering
![Page 9: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/9.jpg)
||
Controller replicas should process events in the same order.
All controller application instances should reach the same internal state.
13.05.2016Michel Kaporin 9
Total Event Ordering – Design Goal #1
Bandwidth Allocation
![Page 10: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/10.jpg)
|| 13.05.2016Michel Kaporin 10
Exactly-Once Event Processing
![Page 11: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/11.jpg)
||
All the events are processed, and neither lost nor processed repeatedly.
13.05.2016Michel Kaporin 11
Exactly-Once Event Processing – Design Goal #2
linkdown Under Failures
![Page 12: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/12.jpg)
|| 13.05.2016Michel Kaporin 12
Exactly-Once Execution of Commands
prior srcip action
3 10.0.0.21/32 fwd(2)
2 10.0.0.0/16 fwd(3)
1 10.0.0.0/8 fwd(2)
![Page 13: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/13.jpg)
||
Any given series of commands are executed once and only once on the
switches.
13.05.2016Michel Kaporin 13
Exactly-Once Execution of Commands – Design Goal #3
Routing Under Repeated Commands
![Page 14: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/14.jpg)
|| 13.05.2016Michel Kaporin 14
Ravana Protocol
![Page 15: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/15.jpg)
||
Controller platform that provides an
abstraction of a fault-free centralised
controller
Entire event-processing cycle = Transaction
All or none of the transaction components are executed.
Uses existing distributed systems’ techniques in
SDN
13.05.2016Michel Kaporin 15
Ravana
![Page 16: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/16.jpg)
||
Two-phase replication protocol
OpenFlow interface extensions
Correctness properties for centralised controller
Real transparent runtime prototype with low overhead
13.05.2016Michel Kaporin 16
Ravana Contributions
![Page 17: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/17.jpg)
||
Two-phase replication protocol
Extends RSM
Each phase adds event-processing information to a replicated in-memory log
1st stage
Ensures every received event is replicated
2nd stage
Conveys that the event-processing transaction has completed.
13.05.2016Michel Kaporin 17
Ravana Components – Replication Protocol
![Page 18: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/18.jpg)
||
Extended Control Channel Interface
The channel between controller and switches
1. RPC level ACKs and retransmission mechanisms
Ensures message delivery at least once
2. Each message has unique ID, receive-side filtering
Guarantees at most once messages
13.05.2016Michel Kaporin 18
Ravana Components – Extended Interface
![Page 19: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/19.jpg)
|| 13.05.2016Michel Kaporin 19
Protocol Overview
![Page 20: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/20.jpg)
||
If master fails:
1. A leader election component elects new master.
2. New master finishes processing any logged events to catch up with failed master state.
3. New master registers itself as a master with switches.
4. Proceeds with normal controller operation.
13.05.2016Michel Kaporin 20
Master Controller Failure Case
![Page 21: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/21.jpg)
||
Exactly-Once Event Processing
Crash case (i)
Crash case (ii)
Total Event Ordering
Crash case (iii) and (iv)
Exactly-Once Command
Execution
Crash case (v) and (vi)
13.05.2016Michel Kaporin 21
Protocol Insights – Potential Fail Cases
![Page 22: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/22.jpg)
|| 13.05.2016Michel Kaporin 22
Correctness
![Page 23: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/23.jpg)
||
If the trace of observations made by
users in the fault-tolerant system is a
possible trace in the fault-free system,
then the fault-tolerant system is
observationally indistinguishable
from a fault-free system.
13.05.2016Michel Kaporin 23
Observational Indistinguishability
Two properties:
Safety
Liveness
Ravana provides transactional semantics to the entire “control loop”
Event delivery, ordering and processing
Command execution
![Page 24: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/24.jpg)
|| 13.05.2016Michel Kaporin 24
Performance Optimisations
![Page 25: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/25.jpg)
||
Parallel logging of events
Total order is imposed by IDs
Multiple threads write events in parallel
Processing multiple transactions in parallel
Pipelining multiple commands without waiting for ACKs
TCP sorts out ordering
Clearing switch buffers
Event buffer (Ebuf)
Command buffer (Cbuf)
13.05.2016Michel Kaporin 25
Performance Optimisations
![Page 26: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/26.jpg)
|| 13.05.2016Michel Kaporin 26
Implementation
![Page 27: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/27.jpg)
||
Ryu
Message-parsing library
Raw messages -> OpenFlow messages
Leader election
ZooKeeper
Failure detected with a help of hearbeat messages
Election as a competition for a master lock
Event logging
Event batching
13.05.2016Michel Kaporin 27
1. Controller Runtime
Modifications:
1. Controller runtime
2. Switch runtime
3. Control channel
![Page 28: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/28.jpg)
||
Event and command buffers
Modified Open vSwitch (v1.10)
If master fails, connection manager sends buffered events.
Filters to check if a newly received command has been
executed already.
13.05.2016Michel Kaporin 28
2. Switch Runtime
Modifications:
1. Controller runtime
2. Switch runtime
3. Control channel
![Page 29: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/29.jpg)
||
Modified OpenFlow 1.3 controller-switch interface
EVENT_ACK
CMD_ACK
Ebuf_CLEAR
Cbuf_CLEAR
Unique transaction IDs (XID)
XID field increment on Open vSwitch
13.05.2016Michel Kaporin 29
3. Control Channel
Modifications:
1. Controller runtime
2. Switch runtime
3. Control channel
![Page 30: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/30.jpg)
|| 13.05.2016Michel Kaporin 30
Performance Evaluation
![Page 31: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/31.jpg)
|| 13.05.2016Michel Kaporin 31
Throughput
Throughput Overhead
(flow responses per second)
Ravana’s overhead
Python - 16.4%
PyPy - 31.4%
![Page 32: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/32.jpg)
|| 13.05.2016Michel Kaporin 32
Scalability
Throughput with different number of switches
Controller runtime can
manage large number of
parallel switch connections
efficiently.
![Page 33: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/33.jpg)
||
Average failover time is 75ms
40ms to detect failure and elect new
leader
25ms to catch up with old master
10ms to register role with switches
13.05.2016Michel Kaporin 33
Failover Times
CDF for Failover Time
![Page 34: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/34.jpg)
|| 13.05.2016Michel Kaporin 34
Throughput Overhead
Throughput Overhead for Correctness Guarantees
![Page 35: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/35.jpg)
|| 13.05.2016Michel Kaporin 35
Ravana: Summary
Design Goals and Mechanisms
Different solutions for fault-tolerant controllers
![Page 36: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/36.jpg)
Thank you.
![Page 37: Ravana: Controller Fault-Tolerance in SDN](https://reader033.vdocuments.net/reader033/viewer/2022051201/589ed4c91a28abea498bf9f0/html5/thumbnails/37.jpg)
||
Naga Katta, Haoyu Zhang, Michael Freedman, and Jennifer Rexford. 2015.
Ravana: controller fault-tolerance in software-defined networking. In
Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined
Networking Research (SOSR '15). ACM, New York, NY, USA, Article 4, 12
pages.
Wikipedia contributors, "Ravana" Wikipedia, The Free Encyclopedia,
https://en.wikipedia.org/w/index.php?title=Ravana&oldid=719503575 (accessed
May 10, 2016).
13.05.2016Michel Kaporin 37
References