reliability and availability of cloud computing (bauer/cloud computing) || summary
Post on 15-Dec-2016
Embed Size (px)
Cloud computing is a business model that enables computing to be offered as a utility service, thereby shifting computing from a capital intensive activity to an expense item. Just as electric utilities and railroad companies freed consumers of power and land transportation from the capital expense of building private infrastructure, cloud com-puting enables consumers to focus on solving their specifi c business problems rather than on building and maintaining computing infrastructure. The U.S. National Insti-tute of Standards and Technology ( NIST ) offers fi ve essential characteristics of cloud computing:
1. on - demand self - service; 2. broad network access; 3. resource pooling; 4. rapid elasticity; and 5. measured service.
A handful of common characteristics are shared by many computing clouds, including virtualization and geographic distribution. Beyond shifting computing from a capital expense topic to a pay - as - you - go operating expense item, rapid elasticity and other
Reliability and Availability of Cloud Computing, First Edition. Eric Bauer and Randee Adams. 2012 Institute of Electrical and Electronics Engineers. Published 2012 by John Wiley & Sons, Inc.
SERVICE RELIABILITY AND SERVICE AVAILABILITY 297
characteristics of cloud computing enable greater fl exibility and faster service deploy-ment than traditional computing models.
Virtualization is one of the common characteristics of cloud computing. Virtual-ization decouples application and operating system software from the underlying soft-ware by inserting a hypervisor or virtual machine manager above the physical hardware, which presents a virtual machine to the guest operating system and application software running on that guest operating system. Virtualization technology can boost resource utilization of modern server hardware by permitting several application instances executing in virtual machines to be consolidated onto a smaller number of physical machines, thereby dramatically reducing the number of physical systems required. Applications generally leverage virtualization in one or more of the following usage scenarios:
Hardware Independence . Virtualization is used to enable applications to be deployed on different (e.g., newer) hardware platforms.
Server Consolidation . Virtualization is used to enable several different applica-tions to share the same physical hardware platform, thereby boosting utilization of the underlying physical hardware.
Multitenancy . Virtualization is used to facilitate offering independent instances of the same application or service to different customers from shared hardware resources, such as offering distinct instances of an e - mail application to different companies.
Virtual Appliance . Ultimately, software applications can be offered as download-able appliances that are simply loaded onto virtualized platform infrastructure. While a commercially important application may not yet be as simple to buy and install as a household appliance, like a toaster, virtualization can streamline and simplify the process for customers.
Cloud Deployment . Virtualization is used to enable applications to be hosted on
cloud providers servers, and take advantage of cloud capabilities, such as rapid elasticity growth and degrowth of service capacity.
14.1 SERVICE RELIABILITY AND SERVICE AVAILABILITY
Failures are inevitable. The service impact of failure is measured on two dimensions:
Extent . How many users or operations are impacted.
Duration . How many seconds or minutes of service impact accrues.
While extent of failure linearly affects service impact (e.g., impacting 100 user sessions is nominally twice as bad as impacting only 50 sessions), the duration of impact is not linear because of the way modern networked applications are implemented. Failure impacts that are very brief (e.g., less than 10 or perhaps a few hundred milliseconds) are often effectively concealed from end users via mechanisms like automatic protocol
message retries for transactions and lost packet compensation algorithms for streaming media; these brief events are often referred to as transient . Failures that are somewhat longer (e.g., less than a few seconds) will often cause some transactions or sessions to present signs of failure to users, such as web pages failing to load successfully, call attempts failing to complete and returning ringback promptly, or noticeable impair-ments to streaming media sessions. Service users will often accept occasional service failures and retry failed operations, such as canceling a stuck webpage and explicitly reloading the page, or redialing after a failed call attempt. If the fi rst (or perhaps second) retry succeeds, then the failed operation will typically count against service reliability metrics as a defective operation; since service was only impacted briefl y, the service will not have been considered down so the failure duration will not count as outage downtime. However, if the failure duration stretches to many seconds, then reasonable users will abandon service retries and deem the service to be down, so availability metrics will be impacted. This is illustrated in Figure 14.1 .
Since failures are inevitable, the goal of high availability systems is to automati-cally detect and recover from failures in less than the maximum acceptable service disruption time so that outage downtime does not accrue for (most) failure events, and ideally service reliability metrics are not impacted either, as shown in Figure 14.1 . The maximum acceptable service disruption target will vary from service to service based on user expectation, technical factors (e.g., protocol recovery mechanisms), market factors (e.g., how reliable alternative technologies are), and other considerations. Thus, the core challenge of service availability of cloud computing is to assure that inevitable failures are automatically detected and recovered fast enough that users don t experi-ence unacceptable service disruptions.
Figure 14.1. Failure Impact Duration and High Availability Goals.
Maximum acceptable service disruption
Often several seconds
~~1 second ~~10 seconds~100 millisecondsNominallogarithmic
timeline Service AvailabilityMetrics Impacted
Service ReliabilityMetrics Impacted
High availability goalautomatically detect and recover from
failures in less than the maximum acceptable service disruption
latency so failure doesnt accrue service downtime and impact availability
metrics.and dissatisfy users
FAILURE ACCOUNTABILITY AND CLOUD COMPUTING 299
14.2 FAILURE ACCOUNTABILITY AND CLOUD COMPUTING
Information systems require a handful of fundamental ingredients to function. Comput-ing hardware executing application software interworks with client devices via applica-tion protocol payloads across IP networks . The computing hardware is installed in a suitable data center environment and must be provided with suitable electrical power . Application software and the underlying hardware inevitably require application, user and confi guration data to provide useful service. Human staff is required to provision and maintain the data, software, hardware, and supporting infrastructure; enterprise policies defi ne the interactions between ingredients and guide the actions of human staff. In addition to ordinary single failures of, say, hardware components, physical systems are vulnerable to force majeure or disaster events, like earthquakes, fi res, and fl oods, which can simultaneously impact multiple ingredients or components. This is illustrated in Figure 14.2 , repeated from Figure 3.4 . All of these ingredients are subject to risks, which can compromise ingredient availability, thereby impacting end user service.
Traditionally, enterprises broadly factored accountability for failures and outages into three broad buckets: product attributable, customer (or enterprise or user) attribut-able, and externally attributable. Figure 14.3 (repeated from Figure 4.1 ) visualizes the traditional factorization of accountability by ingredients. Accountability for each ingre-dient maps across the three traditional categories as follows:
Product suppliers are primarily accountable for the hardware and software they supply, and the ability of that hardware and software to interwork with other systems via defi ned application protocol payloads.
Customers are accountable primarily for the physical security, temperature, humidity, and other environmental characteristics of the facility where the hard-ware is installed, as well as for providing necessary electrical power and IP network connectivity. Customers are also responsible for their application, user, and confi guration data, as well as for the operation policies and human staff.
Figure 14.2. Eight - Ingredient Plus Data Plus Disaster (8i + 2d) Model.
Force majeure and disaster events
External : Data centers are inherently vulnerable to force majeure and disaster events like earthquakes, tornadoes, fi res, and so on. As these risks are not appro-priately attributed to either the customer or the supplier, they are placed in this external attributabili