characterizing cloud computing hardware reliability authors :

CHARACTERIZING CLOUD COMPUTING HARDWARE

RELIABILITY

Authors:Kashi Venkatesh Vishwanath ;

Nachiappan NagappanPresented By:

Vibhuti Dhiman

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

INTRODUCTION

Background :

• “Hardware component failure is the norm rather than exception”

• Presence of survivable networks is insufficient ; What if the source and destination computing resources fail ??

Abstract :

• Datacenters (DC) host hundreds and thousands of servers networked via hundreds of switches/routers that communicate with each other to coordinate tasks in order to deliver the cloud computing services

• The servers, further consist of multiple hard disks, memory modules, network cards, processors, etc. each of which are capable of failing.

• The paper’s focus is on detailed analysis of component failures; and ties together component failure patterns to arrive at server failure rates for the DCs.

Paper Objectives:

• Explore the relationship between the failures and a large no. of factors, for instance, age of the machine

• Quantify the relationship between successive failures on the same machine

• Perform predictive exploration in a DC to mine for factors that explain the reason behind failures.

• show empirically that the reliability of machines that have already seen a hardware failure in the past is completely different than those of servers that have not seen any such event.

OUTLINE


Data Sources used in the study

1. Inventory of machines: variety of information regarding the servers , for instance, unique serial no. to identify the server, location of datacenter, role of the machine

2. Hardware Replacements: This is part of the trouble tickets that are filed for hardware incidents. It includes the information like: when the ticket was filed, how the fault was fixed etc.

3. Configuration of machines: to track the failure rate of individual components, for instance, no. of hard disks, memory modules, their serial IDs, associated server ID

Server Inventory(nature and configuration of machines used in the dataset)

1. Subset of machines: details on part replacement for over 100,000 servers.

2. Age profile of machines: Age of the machine when a fault/repair happened. It was observed that 90% of the machines in the study were less than 4 years old. But there were also instances of the machines that were around 9 years old.

3. Machine Configuration: On an average there were 4 disks and 5 memory modules per server.

60% of the servers have only 1 disk ; but 20% of the servers have more than 4 disks.

OUTLINE


Some Statistics..

All numbers reported henceforth, are normalized to 100 servers

The authors observed a total of 20 replacements in a period of 14 months, contained in around 9 machines. This is an Annual Failure Rate (AFR) of 8%

The average no. of repairs seen by a ‘repaired’ machine is 2

The cost of per server repair (which includes downtime; IT ticketing system to send a technician; hardware repairs is $300. This amounts close to 2.5 million dollars for 100,000 servers.

Classifying Failures for Server

Hard disks are the not only the most replaced component, they are also the most dominant reason behind server failure!!

70%

6%

5%18%

Hard DiskRaid ControllerMemoryOthers

Failure Rate for Components

2.7%

0.7%

0.1%

2.4%Hard DiskRaid ControllerMemoryOthers

Component Failure Rate estimation

• Look at the total no. of components of each type; and determine the total no. of failure of the corresponding type

• The numbers are approximation as they do not provide certain information like which one of the many hard disks failed in the RAID array.

• The percentage is obtained by dividing the total no. of replacements with the total no. of components.

• This can result in double counting the disks in a RAID array, thus the values reported are an upper bound on individual component failure rate.

Age distribution of hard disk failures

Number of repairs against age in weeks

• In initial stage of growth it is approximately exponential; and then, as saturation begins, the growth slows, eventually remaining constant.

• That is, with age, failures grow almost exponentially and then after a certain saturation phase grow at a constant rate, eventually tapering off

Classifying Failures - Second Technique

Classification Trees:

• Goal: To see if failures could be predicted using metrics collected from the environment, operation and design of the servers in the DC.

• Metrics used: datacenter name; location; manufacturer; design (no. of disks, memory capacity)

Important observations from Classification Trees

1. The age of the server, the configuration of the server, the location of the server within a rack, workload run on the machine, none of these were found to be a significant indicator of failures.

2. The actual DATACENTER in which the failure is located could have an important role to playing the reliability of the system .

3. The MANUFACTURER is also an interesting result as different hardware vendors have different inherent reliability values associated with them.

OUTLINE


• Examine a number of different predictors for failures

• Metric used: Repairs Per Machine (RPM): obtain by dividing the total no. of repairs by the total no. of machines.

• Process to plot the graph:1. group machines based on no. of hard disks they contain

2. look for strong indicators of failure rate in the number of server, the average age as well as no. of hard disks

3. plot the RPM as a function of the no. of hard disks in a server.

Repairs per machine as a function of number of disks. This includes all machines, not just those that were repaired.

Repairs per machine as a function of number of disks. This is only for machines that saw at least 1 repair event.

Understanding Failure Patterns

To Summarize:» There is some structure present in the failure

characteristics of servers that have already seen some failure event in the past.

» There is no such obvious pattern in the aggregate set of machines

» The number of repairs on a machine shows a very strong correlation to the number of disks the machine has

Further understanding Successive Failures

Observation: 20% of all repeat failures happen within a day of the first failure; 50% of all repeat failures happen within 2 weeks of the first failure.

Distribution of Days between successive failures fits the inverse curve very well

Successive Failures:» The general form of the inverse equation is

represented byD = C1+ C2 / N

where D is the days between successive failures, C1 and C2 are constants, and N is the number times of second repair

OUTLINE


» Jefferey Dean presented numbers and experiences from running the Google infrastructure. He observed that disk AFR is in the range 1-5% and server crash is in the range 2 to 4%.

» Google - They classified all faults and found that software related errors are around 35% followed by configuration faults around 30%. Human and networking related errors are 11% each and hardware errors are less than 10%.

» Pinheiro et. al [15]. - They find that disk reliability ranges from 1.7% to 8.6%. They find that temperature and utilization have low correlation to failures.

» Weihand et. Al - Their conclusion is that disk failure rate is not indicative of storage subsystem failure rate.

OUTLINE


Cloud Computing infrastructure puts onus on the underlying software; which in turn runs on commodity hardware. This makes cloud computing infrastructure vulnerable to hardware failures.

Hard disks are the number ONE replaced components

8% of the servers can expect to see at least ONE hardware incident in a given year.

Upon seeing a failure, the chances on seeing another failure on the same server is high. The authors observe that the distribution of successive failure on a machine fits an inverse curve.

It is also observed that location of the datacenter and the manufacturer are the strongest indicators of failures.

Limitations:

• The reports are based on a limited time period of 14 months.

• The results are potentially biased against the environmental conditions, technology, workload characteristics etc. prevalent during that period.

• The authors do not investigate the cause of the fault or even the timing. The investigation is only the repair events at a coarse scale and understanding what model it fits.

Thankyou !

characterizing cloud computing hardware reliability authors :

Documents

hardware failure

configuration of machines

failure patterns5

hardware component failure

reliability of machines

server failure rates

inventory of machines

component failure patterns