qos-aware fault tolerance in grid computing · scuola superiore sant’anna and cnit pisa, italy...
TRANSCRIPT
![Page 1: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/1.jpg)
Scuola Superiore Sant’Anna
QoS-Aware Fault Tolerance in Grid Computing
L. Valcarenghi, F. Cugini, F. Paolucci, and P. CastoldiScuola Superiore Sant’Anna and CNIT
Pisa, Italy
Workshop on Reliability and Robustness in Grid Computing SystemsGGF16
Feb. 13-16, 2006, Athens, Greece
![Page 2: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/2.jpg)
Outline
• Grid fault tolerance• Quality of Service (QoS) aware fault
tolerance• Integrated QoS aware fault tolerance• Performance evaluation• Conclusions
![Page 3: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/3.jpg)
Approaches for Grid Fault Tolerance
Applicationend-user applications
Collectivecollective resource control
Resourceresource management
ConnectivityInter-process communication,
protection
Fabricbasic hardware and software
ApplicationMiddleware
Transport
Link
delegated to WAN, MAN, and LAN resilience schemes
delegated to HW, SW, and farm failover schemes
Tasks and Data Replicas
Condor-Gcheckpointing, migration, DAGMan
GT2/GT3GridFTP
Reliable File Transfer (RFT)Replica Location Service (RLS)
Fault Tolerant TCP (FT-TCP)
Application specific fault tolerant schemes based on middleware
fault detection
Fault Tolerant Schemes TCP/IP Stack
Internet/Network
Layered Grid Architecture
![Page 4: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/4.jpg)
QoS-Aware Fault Tolerance
• Def. QoS-Aware Fault Tolerance:– Capability of overcoming both hardware and software
failures while maintaining communication QoSguarantees
• Def. QoS-Capable Layer– Grid layer capable of guaranteeing communication QoS
• Example:– Upon failure of main data center bandwidth guaranteed
connectivity must be guaranteed to data replica center
![Page 5: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/5.jpg)
QoS Awareness in Grid Fault Tolerance
Fabric
ApplicationCollectiveResourceConnectivity
QoS Unaware
QoS Capable
Application
Middleware
TCP/IP
MPLS
SONET/SDH
Optical Transport Network (OTN)
![Page 6: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/6.jpg)
Application/Middleware Fault Tolerance
• Advantages– Network layer independent– Flexible (e.g., degree of failover dependent on
application)• Drawbacks
– Application dependent– User driven– Need for TCP synchronization– Slow reaction to failures– Not scalable (e.g., CPU and storage)– No communication QoS guarantees
![Page 7: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/7.jpg)
QoS Capable TCP/IP
• Dynamic rerouting with Diffserv• Advantages
– Pervasiveness• Drawbacks
– No Traffic Engineering• Example
– After rerouting, packets with the same Type of Service (ToS) compete for the same insufficient resources along the shortest path
![Page 8: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/8.jpg)
QoS Aware Fault Tolerance Below Layer 3
• QoS aware fault tolerance through connection oriented communication in QoScapable layer– Multi-Protocol Label Switching (MPLS):
• Label Switched Paths (LSPs)
– SONET/SDH• SONET/SDH Path
– OTN• Lightpath (i.e., wavelength channel)
![Page 9: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/9.jpg)
Implementing Multi-layer QoS Aware Fault Tolerance
• Assumption– Grid computing services requiring communication QoS guarantees
(e.g., collaborative visualization)– QoS parameter
• minimum bandwidth• Objective
– Maximize recovered connections and minimize required network resources upon network link failure
• Proposed approach– Integrating QoS unaware layer and QoS capable layer fault
tolerance ⇒ QoS aware integrated fault tolerance• QoS capable layer fault tolerance
– (G)MPLS path restoration• Software layer fault tolerance
– Service replication (server migration)
![Page 10: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/10.jpg)
Network Layer Fault Tolerance (Path Restoration) Issue: Blocking
Primary VideoServer
Backup Video
Server
Primary LSP
Backup LSP
InsufficientCapacity
Client
Primary LSP
![Page 11: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/11.jpg)
Primary VideoServer
Backup Video
Server
Software Layer Fault Tolerance Issue: Wasted Network Capacity
Primary LSP
LSP to Backup Video Server
Implies 1:1 Path Protection
Client
![Page 12: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/12.jpg)
Primary VideoServer
Backup Video
Server
Non Integrated Fault Tolerance Issue: Unnecessary Server Migration when Backup LSP is available
Primary LSP
Backup LSP
LSP to Backup Video Server
Absence ofInter-layer Coordination
Client
![Page 13: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/13.jpg)
Integrated Fault Tolerance Advantages: Path Restoration + Service Replication
Primary VideoServer
Backup Video
Server
Primary LSP
Backup LSP
LSP to Backup Video Server
Client
Primary LSP
![Page 14: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/14.jpg)
Simulation Scenario
• Physical network– 100 randomly generated connection
matrices– Bidirectional connection generation– Bidirectional connection rerouting
• Expected network blocking probability– average ratio between number of
unrecovered connections and failed connections
• Expected replica utilization ratio– Average ratio between the number
of locations utilized for service replication and number of locations allowed for service replication
• Expected path restoration utilization– Average number of times the original
server location node is utilized as replica location normalized ot the number of replica locations utilized
• Expected connection path length– average length of connection paths to
reach replica location• Evaluation scenarios
– Limited number of replicas• per location• per failed connection between
(s,d) pair – Limited distance (hop) of allowed
replica locations– Minimum required replication
flow
![Page 15: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/15.jpg)
Integrated Restoration Performance
• Integrated restoration outperforms OSPF dynamic rerouting resilience
• Integrated restoration performs as well as service migration resilience but by utilizing path restoration decreases the need for service synchronization and restart
![Page 16: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/16.jpg)
Integrated Multi-layer QoS Aware Fault Tolerance
•Service Agent setup a new connection from another server•Client switches-over
Connection rerouting
Notification to Service Agent
Notification to the ingress router
Connection tear down byIngress router
Connection tear down by transit router
Integrated Server Migration
Network restoration
![Page 17: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/17.jpg)
Rerouting after connection preemption
Throughput at the client fallsduring LSP rerouting (100 ÷900 ms interruption)
Migration is not neededbecause LSP rerouting is fast enough
Stream carried on a guaranteedbandwidth LSP
New LSP at higher priority set up
LSP carriyng the stream isprempetd and rerouted on a new
path
![Page 18: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/18.jpg)
Migration after connection preemption 1(Integrated scheme)
Agent sets up LSP fromprimary server
Agent periodically checksLSP status
LSP carrying the stream is prempted
Premption isdetected by the agent
Agent provides to setup an LSP from backup server
As soon as LSP is up Ingress router notifies
the Agent
Agent then notifiesthe client server
migration
Recovery takes about 2-4 seconds tocomplete:
• Less than 1 second (polling period) for fault detection
• 2-3 seconds for network reconfiguration
• Network latency negligible
![Page 19: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/19.jpg)
Replica Placement Problem Objective
– Utilize service replication for guaranteeing high percentage of recovered inter-service connections
– Limit number of allowed replica locations to optimize utilized computational resources
– Utilize simple and efficient heuristics for replica placement• Proposed Approach
– Place replicas in nodes adjacent to original service location– Form cluster of nodes with same service replicas (i.e., service
islands)• Expected results
– Improve service inter-connectivity recovery– High replica utilization ratio
![Page 20: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/20.jpg)
HOP Heuristic
• Place replica in all the nodes reachable in H-hop from the original service location
ClientClient
ServerServer
Service ReplicaService ReplicaH=1H=1
Nodal degree=4Nodal degree=4Nodal degree=4Nodal degree=4
![Page 21: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/21.jpg)
Super-Node Degree Heuristic
Nodal degree=4Nodal degree=4Nodal degree=5Nodal degree=5
ClientClient
ServerServer
Service ReplicaService Replica
• Place replica in all the nodes reachable in H-hop from the original service location and incrementing the previous step super-node nodal degree
H=1H=1
Nodal degree=5Nodal degree=5
NONO
![Page 22: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/22.jpg)
HOP and Super-node Degree
• H=∞ and α=0 expected network blocking probability lower bound
• For both α=0 and α=1 – Super-Node degree and HOP heuristic
similar expected restoration blocking probability
– Super-node degree better expected replica utilization ratio
• H, number of hops from destination to consider candidate service island nodes
• α=0, no capacity need for replica update (static replica placement)
• α=1, full capacity (same as client-server) capacity for replica update (dynamic replica placement)
![Page 23: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/23.jpg)
Summary
• Need for QoS aware fault tolerance in Grid Computing– Not only recovering connectivity but also maintaining
communication QoS parameters
• Proposed implementation– Integration of application/middleware fault tolerant
schemes (e.g., replication) with QoS capable layer (i.e., below layer 3) fault tolerance (e.g., LSP restoration)
• Further work– QoS aware fault tolerance as grid service– Other QoS parameters
![Page 24: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec3787f9b877d78043c3f24/html5/thumbnails/24.jpg)
E-mail: [email protected]@[email protected]@sssup.it
Sant’Anna School & CNIT, CNR research area, Via Moruzzi 1, 56124 Pisa, Italy