characterizing cloud computing hardware reliability authors :

29
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman

Upload: dyan

Post on 23-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors : Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By : Vibhuti Dhiman. OUTLINE 1.Introduction 2. Datacenter Characterization 3. Characterizing Faults 4. Failure Patterns 5. Related work 6. Conclusion. INTRODUCTION. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

CHARACTERIZING CLOUD COMPUTING HARDWARE

RELIABILITY

Authors:Kashi Venkatesh Vishwanath ;

Nachiappan NagappanPresented By:

Vibhuti Dhiman

Page 2: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 3: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

INTRODUCTION

Background :

• “Hardware component failure is the norm rather than exception”

• Presence of survivable networks is insufficient ; What if the source and destination computing resources fail ??

Abstract :

• Datacenters (DC) host hundreds and thousands of servers networked via hundreds of switches/routers that communicate with each other to coordinate tasks in order to deliver the cloud computing services

Page 4: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

• The servers, further consist of multiple hard disks, memory modules, network cards, processors, etc. each of which are capable of failing.

• The paper’s focus is on detailed analysis of component failures; and ties together component failure patterns to arrive at server failure rates for the DCs.

Paper Objectives:

• Explore the relationship between the failures and a large no. of factors, for instance, age of the machine

• Quantify the relationship between successive failures on the same machine

• Perform predictive exploration in a DC to mine for factors that explain the reason behind failures.

Page 5: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

• show empirically that the reliability of machines that have already seen a hardware failure in the past is completely different than those of servers that have not seen any such event.

Page 6: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 7: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Data Sources used in the study

1. Inventory of machines: variety of information regarding the servers , for instance, unique serial no. to identify the server, location of datacenter, role of the machine

2. Hardware Replacements: This is part of the trouble tickets that are filed for hardware incidents. It includes the information like: when the ticket was filed, how the fault was fixed etc.

3. Configuration of machines: to track the failure rate of individual components, for instance, no. of hard disks, memory modules, their serial IDs, associated server ID

Page 8: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Server Inventory(nature and configuration of machines used in the dataset)

1. Subset of machines: details on part replacement for over 100,000 servers.

2. Age profile of machines: Age of the machine when a fault/repair happened. It was observed that 90% of the machines in the study were less than 4 years old. But there were also instances of the machines that were around 9 years old.

3. Machine Configuration: On an average there were 4 disks and 5 memory modules per server.

60% of the servers have only 1 disk ; but 20% of the servers have more than 4 disks.

Page 9: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 10: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Some Statistics..

All numbers reported henceforth, are normalized to 100 servers

The authors observed a total of 20 replacements in a period of 14 months, contained in around 9 machines. This is an Annual Failure Rate (AFR) of 8%

The average no. of repairs seen by a ‘repaired’ machine is 2

The cost of per server repair (which includes downtime; IT ticketing system to send a technician; hardware repairs is $300. This amounts close to 2.5 million dollars for 100,000 servers.

Page 11: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Classifying Failures for Server

Hard disks are the not only the most replaced component, they are also the most dominant reason behind server failure!!

70%

6%

5%18%

Hard DiskRaid ControllerMemoryOthers

Page 12: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Failure Rate for Components

2.7%

0.7%

0.1%

2.4%Hard DiskRaid ControllerMemoryOthers

Page 13: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Component Failure Rate estimation

• Look at the total no. of components of each type; and determine the total no. of failure of the corresponding type

• The numbers are approximation as they do not provide certain information like which one of the many hard disks failed in the RAID array.

• The percentage is obtained by dividing the total no. of replacements with the total no. of components.

• This can result in double counting the disks in a RAID array, thus the values reported are an upper bound on individual component failure rate.

Page 14: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Age distribution of hard disk failures

Page 15: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Number of repairs against age in weeks

• In initial stage of growth it is approximately exponential; and then, as saturation begins, the growth slows, eventually remaining constant.

• That is, with age, failures grow almost exponentially and then after a certain saturation phase grow at a constant rate, eventually tapering off

Page 16: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Classifying Failures - Second Technique

Classification Trees:

• Goal: To see if failures could be predicted using metrics collected from the environment, operation and design of the servers in the DC.

• Metrics used: datacenter name; location; manufacturer; design (no. of disks, memory capacity)

Page 17: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Important observations from Classification Trees

1. The age of the server, the configuration of the server, the location of the server within a rack, workload run on the machine, none of these were found to be a significant indicator of failures.

2. The actual DATACENTER in which the failure is located could have an important role to playing the reliability of the system .

3. The MANUFACTURER is also an interesting result as different hardware vendors have different inherent reliability values associated with them.

Page 18: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 19: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

• Examine a number of different predictors for failures

• Metric used: Repairs Per Machine (RPM): obtain by dividing the total no. of repairs by the total no. of machines.

• Process to plot the graph:1. group machines based on no. of hard disks they contain

2. look for strong indicators of failure rate in the number of server, the average age as well as no. of hard disks

3. plot the RPM as a function of the no. of hard disks in a server.

Page 20: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Repairs per machine as a function of number of disks. This includes all machines, not just those that were repaired.

Page 21: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Repairs per machine as a function of number of disks. This is only for machines that saw at least 1 repair event.

Page 22: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Understanding Failure Patterns

To Summarize:» There is some structure present in the failure

characteristics of servers that have already seen some failure event in the past.

» There is no such obvious pattern in the aggregate set of machines

» The number of repairs on a machine shows a very strong correlation to the number of disks the machine has

Page 23: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Further understanding Successive Failures

Observation: 20% of all repeat failures happen within a day of the first failure; 50% of all repeat failures happen within 2 weeks of the first failure.

Distribution of Days between successive failures fits the inverse curve very well

Successive Failures:» The general form of the inverse equation is

represented byD = C1+ C2 / N

where D is the days between successive failures, C1 and C2 are constants, and N is the number times of second repair

Page 24: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 25: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

» Jefferey Dean presented numbers and experiences from running the Google infrastructure. He observed that disk AFR is in the range 1-5% and server crash is in the range 2 to 4%.

» Google - They classified all faults and found that software related errors are around 35% followed by configuration faults around 30%. Human and networking related errors are 11% each and hardware errors are less than 10%.

» Pinheiro et. al [15]. - They find that disk reliability ranges from 1.7% to 8.6%. They find that temperature and utilization have low correlation to failures.

» Weihand et. Al - Their conclusion is that disk failure rate is not indicative of storage subsystem failure rate.

Page 26: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

OUTLINE

» 1.Introduction» 2. Datacenter Characterization» 3. Characterizing Faults» 4. Failure Patterns» 5. Related work» 6. Conclusion

Page 27: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Cloud Computing infrastructure puts onus on the underlying software; which in turn runs on commodity hardware. This makes cloud computing infrastructure vulnerable to hardware failures.

Hard disks are the number ONE replaced components

8% of the servers can expect to see at least ONE hardware incident in a given year.

Upon seeing a failure, the chances on seeing another failure on the same server is high. The authors observe that the distribution of successive failure on a machine fits an inverse curve.

It is also observed that location of the datacenter and the manufacturer are the strongest indicators of failures.

Page 28: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Limitations:

• The reports are based on a limited time period of 14 months.

• The results are potentially biased against the environmental conditions, technology, workload characteristics etc. prevalent during that period.

• The authors do not investigate the cause of the fault or even the timing. The investigation is only the repair events at a coarse scale and understanding what model it fits.

Page 29: CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors :

Thankyou !