1111 reliable network/service infrastructures. 222 availability, reliability and survivability...

1111

Reliable Network/Service Infrastructures

222

Availability, Reliability and Survivability

Availability Reliability Survivability

• The expected ratio of the system uptime to total elapsed time

• Empirical factor

• The probability of the system keep being available (not fail) over certain period of time.

• Empirical factor

• The capability of the system to continue its operation and fulfill its mission in a full or limited scale during failure

• Probabilistic

– Expected time between failures

– Expected time to recover

• Probabilistic

– Expected time between failures

• Non-probabilistic

– Assumes explicit failures of different span and magnitude

MTTRMTBF

MTBF

A

interval time ,MTBF

1

)(

t

tetR

333

Availability Downtime per Year (24x7x365)

99.000% 3 Days 15 Hours 36 Minutes

99.500% 1 Day 19 Hours 48 Minutes

99.900% 8 Hours 46 Minutes

99.950% 4 Hours 23 Minutes

99.990% 53 Minutes

99.999% 5 Minutes

99.9999% 30 Seconds

What Is “High Availability”?

• The ability to define, achieve, and sustain “target availability objectives” across services and/or technologies supported in the network that align with the objectives of the business (i.e. 99.9%, 99.99%, 99.999%)

444

Leading Causes of Downtime

SOURCE: Graph Data: The Yankee Group, The Road to a Five Nines Network, Feb 2004.

• Change management

• Process consistency

• Communications

• Links

• Hardware Failure

• Design

• Environmental issues

• Natural disasters

Telco/ISP35%

Power Failure14%

Human Error 31%

Hardware Failure

12%

Unresolved 8%

555

Link/Circuit Diversity

Enterprise

THIS

Enterprise

THIS, which Is Better Than…Service ProviderNetwork

But what is beyond this???

Enterprise

THIS Is Better Than…

666

Network Point of Presence/Data Center

• Cable management

• Power: Diversity/UPS

• HVAC

• Hardware placement

• Physical security

• Labeling

• Environmental control systems

666

777

Technology Can Increase MTBF

People, Process, and Politics Can Increase Complexity

THIS DECREASES MTBF and Increases MTTR

Network Complexity

Network Design

888

Network Design

• Hierarchical

• Modular and consistent

• Scalable

• Manageable

• Reduced failure

• Domain (Layer II/III)

• Interoperability

• Performance

• Availability

• Security

Primary Design Considerations

999

Examples of Hardware Reliability(Reliability Block Diagrams)

Hardware Reliability = 99.938% with 4 Hour MTTR (325 Minutes/Year)

Hardware Reliability = 99.961% with 4 Hour MTTR (204 Minutes/Year)

Hardware Reliability = 99.9999% with 4 Hour MTTR (30 Seconds/Year)

101010

Network Availability Calculation

Router R1, R2, R3 and R4

MTBF = 16000 Hours

MTTR = 24 Hours

Router Availability R1, R2, R3 and R4

16000/(16000+24) = 0.9985

Can Include Hardware + Software

Components

1Availability of R1, R2 in Parallel with R3, R4

= 1 - ((1-0.997)(1 - 0.997)) = 0.99999104

3

Availability of R1, R2 and R3, R4 in

Series = (0.99850.9985) = 0.997006

2 Network Availability = 99.999%

Only Base on Device Availability

Values; Link Availability Not Included

4

R1

R4R3

R2

111111

High Availability - Layered Approach

Application Level Resiliency

Redundant Processors (RP), Switch Fabric, Line Cards, Ports, Power, CoPP, ISSU, Config Rollback

Circuits, SONET APS, RPR, DWDM, Etherchannel,802.1d, 802.1w, 802.1s, PVST+,Portfast, BPDU guard,PagP, LacP,UDLD, Stackwise technology, PPP,

NSF/SSO,HSRP, VRRP, GLBP, IP Event Dampening , Graceful Restart (GR): BGP, ISIS, OSPF, EIGRP, OER, BGP multipath, fast polling, MARP, incremental SPF

Global Server Load Balancing and positioning Gateways, gatekeepers, SIP servers, DB servers

Protocol Level Resiliency

Transport/Link Level Resiliency

Device Level Resiliency

1111 reliable network/service infrastructures. 222 availability, reliability and survivability...

Documents