whitepaper - netmagic solutions · data centers. in 2005, the telecommunications industry...

6
Data centers sometimes fail. You can build in safeguards and fail safe mechanisms and redundancy through backup systems but like all engineered systems, data centers can -- and sometimes do -- fail. See Table 1 for some of the notable data center outages of 2011 and 2012 to see how even the biggest brands with access to the best technology and resources can suffer from data center outages. WHITEPAPER Data center outages impact, causes, costs, and how to mitigate

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

Data centers sometimes fail. You can build in safeguards and fail safe mechanisms and redundancy through backup systems but like all engineered systems, data centers can -- and sometimes do -- fail. See Table 1 for some of the notable data center outages of 2011 and 2012 to see how even the biggest brands with access to the best technology and resources can suffer from data center outages.

WHITEPAPER

Data center outagesimpact, causes, costs, and how to mitigate

Page 2: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

WHITEPAPER 02

Netmagic Solutions

Few days Water flooded data centers in New York after Hurricane Sandy

Several websites and other services down

Huffington Post, Buzzfeed, Gawker and several others

Few hours Both primary and backup systems failed

A well publicized campaign to encourage athletes and visitors to the Olympics to tweet was affected

Twitter

7 hours Power failure in data center CRM services to customers affected

Salesforce

6 days Online banking down across U.S.

29 million users affectedBank of America

4 days Amazon EC2 (elastic compute cloud) services went down

Users affected worldwideAmazon Web Services

2 - 4 days Customers lost access to applications such as TurboTax Online, QuickBooks Online, Quicken and QuickBase.

Several thousandsIntuit

2 days Gmail affected 120,000 users affectedGoogle

24 hours plus Unavailable worldwide Millions of users affectedBlackberry

24 hours plus Yahoo Mail outage Users affected worldwideYahoo

24 – 72 hours Windows Live, Hotmail inboxes disappear

Users affected worldwideMicrosoft

24 hours plus

4 – 8 hours

Series of data outages

Netflix streaming service affected

Several US states unable to get LTE service

20 million users affected

Verizon

Netflix

4 days Amazon EC2 (elastic compute cloud) services went down

Users affected worldwideAmazon Web Services

TABLE 1Notable Data Center Outages in 2011 and 2012

WHO HOW LONG WHAT HAPPENED IMPACT

2012

2011

Source: See Ref 1, Ref 2

Page 3: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

What is inside a data center?

Causes and cost of data center outage

A data center is a configuration of server rooms, cooling units, storage, batteries, and generators. At the core of a data center are racks and racks of servers. Servers need power, lots of it -- a typical large data center occupies 50,000 square feet of space and consumes 5 MW of power.Bringing in so much power generates massive amounts of heat. This heat is carried away by cooling units that force cool air from the floor, through the racks, and into ducts above.

Data centers collect and store vast amounts of data. This data needs to be stored safely, often for several years (as in the case of financial information). The hardware for storage is therefore stored in secure locations –for example, in underground mines.

Since data centers run on power and utility power can fail, every data center has batteries for backup – thousands of them stacked up and constantly being charged. In the event of a power failure, these battery banks provide power.

But batteries can provide power only for a few minutes at most. To provide power during longer power failures and blackouts, most data centers have banks of diesel generators on standby. And since these massive diesel generators need fuel, data centers need to store thousands of liters of diesel fuel.

Information on data centers is hard to come by. Because data centers are critical pieces of IT infrastructure and store sensitive customer data, data center managers are fiercely protective of their privacy. Probably the first and only major survey of data center outages and costs associated with these outages are two studies by the Michigan based Ponemon Institute sponsored by Emerson Network Power. Both studies are limited to U.S. data centers but can be considered representative of the industry.

WHITEPAPER 03

Netmagic Solutions

So how can businesses ensure that disruptions due to data center glitches are minimized?

First, some perspective.Using an outsourced data center is,in almost all cases, a whole lot more reliable and cost-effective for a company thanbuilding one in-house. That’s because a third-party data center is able to share the very high cost of the technology, infrastructure, and personnel that go into building the data center among multiple customers. In fact, the economies of scale are so compelling that while data centers are growing in size, they are declining in numbers (see Ref 3). Which just means that more companies are outsourcing more of their IT infrastructure to third-party data centers.

Second, it helps to know what makes up a data center in order to better understand what is involved in keeping it robust.

Page 4: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

WHITEPAPER 04

Netmagic Solutions

Datacenter outages – the Indian contextIn the 2011Data Center Risk Index published by hurleypalmerflatt, an engineering consultancy, and Cushman & Wakefield, a real estate consultancy, India ranked at the bottom of the 20 countries ranked in descending order of risk associated with running a data center. The U.S., Canada, and Germany were at the top of the rankings.On the face of it, this is a dismal ranking for a country that is at the center of the global outsourcing revolution. On closer look though, things are not as bad as they seem. To begin with, the Data Center Risk Index is a weighted average of 11 macro and local factors covering a wide range of attributes from the cost of energy to political instability to inflation to availability of water. Depending on their priorities and approaches to risk, individual customers will arrive at significantly different assessments of risk.

This was best highlighted during the world’s largest power blackout when an estimated 600 million people in the northern half of India lost power for two days in July 2012. In spite of the massive disruption across several areas of the economy from public transport to industry to hospitals, there were no reports of major disruptions in data centers anywhere in India (see Ref 4). One ostensible reason is that the bulk of the data centers are located in Mumbai and the south of India while the blackout was in the northern half of India. But the real reason was that India has a chronic power problem and data centers are geared to work through intermittent, low, and no power from public utilities. Most third-party data centers have power back up for days on end – it’s just another risk to be managed.

Outage causesThe first study, National Survey on Data Center Outages, published in September 2010, surveyed 453 individuals responsible for data center operations in the U.S. Of these, 95% said they had an unplanned data center outage in the last two years. Each respondent averaged 2.48 complete shutdowns with an average downtime of 107 minutes. Apart from complete shutdowns, respondents reported far more frequent partial rack- or row-based outages – an average of 6.8 row-based outages with an average downtime of 152 minutes, and an average of 11.2 rack-based outages with an average duration of 153 minutes in a two-year period.

The most frequently cited root causes of data center outage were: UPS battery failure (65%), UPS capacity exceeded (53%), human error (51%), and UPS equipment failure (49%).

The most common responses to unplanned outages were to repair, replace or purchase additional IT or infrastructure equipment, followed by contacting the equipment vendor for support.

Tier 1: Basic99.671% availability

Susceptible to disruptions from both planned and unplanned activity

Single path for power and cooling distribution, no redundant components (N)

May or may not have a raised floor, UPS, or generator

Takes 3 months to implement

Annual downtime of 28.8 hours

Must be shut down completely to perform preventive maintenance

Tier 2: Redundant Components99.741% availability

Less susceptible to disruptions from both planned and unplanned activity

Single path for power and cooling distribution, includes redundant components (N+1)

Includes raised floor, UPS, or generator

Takes 3 to 6 months to implement

Annual downtime of 22.0 hours

Maintenance of power path and other parts of the infrastructure require a processing shutdown

Tier 3:Concurrently Maintainable99.982% availability

Enables planned activity without disrupting computer hardware operation, but unplanned events will still cause disruption

Multiple power and cooling distribution paths, but with only one active path, includes redundant components (N+1)

Includes raised floor and sufficient capacity and distribution to carry load on one path while performing maintenance on the other

Takes 15 to 20 months to implement

Annual downtime of 1.6 hours

Tier 4:Fault Tolerant99.995% availability

Planned activity does not disrupt critical load and data center can sustain at least one worst-case unplanned event with no critical load impact

Multiple active power and cooling distribution paths, includes redundant components (2 (N+1), i.e., 2 UPS each with (N+1) redundancy)

Takes 15 to 20 months to implement

Annual downtime of 0.4 hours

TABLE 2Data Center Resilience Tier Levels

Page 5: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

WHITEPAPER 05

Netmagic Solutions

Going up the levels has a significant cost impact -- construction costs for Tier 3, for instance, are double that for Tier 1.So organizations need to carefully determine an appropriate tier level for their different needs. eBay for example, started out with all their applications in a Tier 4 data center till they analyzed their needs more closely and determined that 80% of their equipment could be shifted out without loss of reliability – search, for instance, could be in a Tier 2 center whereas databases and network backbones needed to be in a Tier 4 center. eBay says they cut their data center Capex and Opex by half by matching applications to data center tier level (see Ref 5).

Experts recommend the following to minimize data center outages and mitigate damage:

} Invest in better equipment. It’s tempting to save money by buying cheap but the cost of hardware failure is very high.

} Provide redundancy -- relying on any single machine or a single component in the core architecture is disastrous.

} When it comes to crucial data, never assume that someone else is automatically protecting you. Have backups.

} Have your data available on multiple servers in multiple data centers. Even consider having them in different geographical regions and spread between different service providers.

How to mitigate data center outages

Outage costs

How to evaluate data center reliability

The second Ponemon Institute study, Calculating the Cost of Data Center Outages, published in February 2011, surveyed 41 independent data centers in the U.S. that experienced at least one complete or partial unplanned shutdown in the previous 12 months.

The survey revealed that data center outages have significant financial consequences ranging from a minimum cost of $38,969 to a maximum of $1,017,746 per organization. The average cost of a data center outage was $505,502 per incident. ($ = 55 INR).

Historically, data centers have been designed in the absence of established standards. This made it very difficult for network managers to choose technologies to build and benchmark data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center infrastructure. The TIA-942 standards cover site space and layout, cabling infrastructure, tiered reliability, and environmental considerations.

Of these, the tiered reliability standards are directly useful to organizations looking to evaluate data center resilience across vendors.

The TIA standards, based on a system pioneered by the New York-based Uptime Institute in the mid-nineties, prescribe architectural, security, electrical, mechanical, and telecommunications recommendations.

There are four tiers of availability from Tiers 1 to 4, with Tier 4 being the most resilient. See Table 2 for a description of the tiers – redundancy is indicated in terms of N where N represents only the necessary system need.

Page 6: WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center

WHITEPAPER 06

The content you have downloaded has been produced with thoughtful, original research efforts by Netmagic. Please do not duplicate or misuse it. You may quote portions of our research in your own material provided you include a proper attribution to this original source. You are free to share this content on the web with

friends and colleagues.© 2013 Netmagic Solutions. All rights reserved.

http://blog.netmagicsolutions.com http://twitter.com/netmagic http://linkedin.com/company/netmagic

www.netmagicsolutions.com

1800 103 3130

ConclusionData center outages are real and they can cause significant loss of revenue. The frequency and duration of data center outages varies by the size of the data center. Outages become less frequent and shorter in duration as data centers increase in size. The smaller the data center the longer and more common the outages. IT equipment failure is the most expensive root cause and human error is the least expensive.But the benefits of outsourcing IT infrastructure to a third-party data center far outweigh the risks. As with all engineered systems, the risk is quantifiable and manageable.

References:

Major data center outages in 2011: http://www.evolven.com/blog/2011-devastating-outages-major-brands.html

Salesforce outage: http://www.informationweek.com/cloud-computing/software/salesforce-outage-follows-data-center-po/240003577

U.S. Datacenters Growing in Size But Declining in Numbers, IDC press release, 9 Oct 2012

India’s Blackout, DataCenter Dynamics, Penny Jones, 31 July 2012, http://www.datacenterdynamics.com/blogs/penny-jones/india%E2%80%99s-blackout

Matching applications to data center tier level: http://blog.uptimeinstitute.com/2011/07/matching-applications-to-data-center-tier-level/