Reliability and Availability of Cloud Computing (Bauer/Cloud Computing) || Summary

Download Reliability and Availability of Cloud Computing (Bauer/Cloud Computing) || Summary

Post on 15-Dec-2016

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>296</p><p> Cloud computing is a business model that enables computing to be offered as a utility service, thereby shifting computing from a capital intensive activity to an expense item. Just as electric utilities and railroad companies freed consumers of power and land transportation from the capital expense of building private infrastructure, cloud com-puting enables consumers to focus on solving their specifi c business problems rather than on building and maintaining computing infrastructure. The U.S. National Insti-tute of Standards and Technology ( NIST ) offers fi ve essential characteristics of cloud computing:</p><p> 1. on - demand self - service; 2. broad network access; 3. resource pooling; 4. rapid elasticity; and 5. measured service. </p><p> A handful of common characteristics are shared by many computing clouds, including virtualization and geographic distribution. Beyond shifting computing from a capital expense topic to a pay - as - you - go operating expense item, rapid elasticity and other </p><p> 14 </p><p>SUMMARY </p><p>Reliability and Availability of Cloud Computing, First Edition. Eric Bauer and Randee Adams. 2012 Institute of Electrical and Electronics Engineers. Published 2012 by John Wiley &amp; Sons, Inc.</p></li><li><p>SERVICE RELIABILITY AND SERVICE AVAILABILITY 297</p><p>characteristics of cloud computing enable greater fl exibility and faster service deploy-ment than traditional computing models. </p><p> Virtualization is one of the common characteristics of cloud computing. Virtual-ization decouples application and operating system software from the underlying soft-ware by inserting a hypervisor or virtual machine manager above the physical hardware, which presents a virtual machine to the guest operating system and application software running on that guest operating system. Virtualization technology can boost resource utilization of modern server hardware by permitting several application instances executing in virtual machines to be consolidated onto a smaller number of physical machines, thereby dramatically reducing the number of physical systems required. Applications generally leverage virtualization in one or more of the following usage scenarios:</p><p> Hardware Independence . Virtualization is used to enable applications to be deployed on different (e.g., newer) hardware platforms. </p><p> Server Consolidation . Virtualization is used to enable several different applica-tions to share the same physical hardware platform, thereby boosting utilization of the underlying physical hardware. </p><p> Multitenancy . Virtualization is used to facilitate offering independent instances of the same application or service to different customers from shared hardware resources, such as offering distinct instances of an e - mail application to different companies. </p><p> Virtual Appliance . Ultimately, software applications can be offered as download-able appliances that are simply loaded onto virtualized platform infrastructure. While a commercially important application may not yet be as simple to buy and install as a household appliance, like a toaster, virtualization can streamline and simplify the process for customers. </p><p> Cloud Deployment . Virtualization is used to enable applications to be hosted on </p><p>cloud providers servers, and take advantage of cloud capabilities, such as rapid elasticity growth and degrowth of service capacity. </p><p> 14.1 SERVICE RELIABILITY AND SERVICE AVAILABILITY </p><p> Failures are inevitable. The service impact of failure is measured on two dimensions:</p><p> Extent . How many users or operations are impacted. </p><p> Duration . How many seconds or minutes of service impact accrues. </p><p> While extent of failure linearly affects service impact (e.g., impacting 100 user sessions is nominally twice as bad as impacting only 50 sessions), the duration of impact is not linear because of the way modern networked applications are implemented. Failure impacts that are very brief (e.g., less than 10 or perhaps a few hundred milliseconds) are often effectively concealed from end users via mechanisms like automatic protocol </p></li><li><p>298 SUMMARY</p><p>message retries for transactions and lost packet compensation algorithms for streaming media; these brief events are often referred to as transient . Failures that are somewhat longer (e.g., less than a few seconds) will often cause some transactions or sessions to present signs of failure to users, such as web pages failing to load successfully, call attempts failing to complete and returning ringback promptly, or noticeable impair-ments to streaming media sessions. Service users will often accept occasional service failures and retry failed operations, such as canceling a stuck webpage and explicitly reloading the page, or redialing after a failed call attempt. If the fi rst (or perhaps second) retry succeeds, then the failed operation will typically count against service reliability metrics as a defective operation; since service was only impacted briefl y, the service will not have been considered down so the failure duration will not count as outage downtime. However, if the failure duration stretches to many seconds, then reasonable users will abandon service retries and deem the service to be down, so availability metrics will be impacted. This is illustrated in Figure 14.1 . </p><p> Since failures are inevitable, the goal of high availability systems is to automati-cally detect and recover from failures in less than the maximum acceptable service disruption time so that outage downtime does not accrue for (most) failure events, and ideally service reliability metrics are not impacted either, as shown in Figure 14.1 . The maximum acceptable service disruption target will vary from service to service based on user expectation, technical factors (e.g., protocol recovery mechanisms), market factors (e.g., how reliable alternative technologies are), and other considerations. Thus, the core challenge of service availability of cloud computing is to assure that inevitable failures are automatically detected and recovered fast enough that users don t experi-ence unacceptable service disruptions. </p><p> Figure 14.1. Failure Impact Duration and High Availability Goals. </p><p>Failure TransientCondition</p><p>DegradedService</p><p>ServiceUnavailable</p><p>Maximum acceptable service disruption</p><p>Often several seconds</p><p>~~1 second ~~10 seconds~100 millisecondsNominallogarithmic </p><p>timeline Service AvailabilityMetrics Impacted</p><p>Service ReliabilityMetrics Impacted</p><p>High availability goalautomatically detect and recover from </p><p>failures in less than the maximum acceptable service disruption </p><p>latency so failure doesnt accrue service downtime and impact availability </p><p>metrics.and dissatisfy users</p></li><li><p>FAILURE ACCOUNTABILITY AND CLOUD COMPUTING 299</p><p> 14.2 FAILURE ACCOUNTABILITY AND CLOUD COMPUTING </p><p> Information systems require a handful of fundamental ingredients to function. Comput-ing hardware executing application software interworks with client devices via applica-tion protocol payloads across IP networks . The computing hardware is installed in a suitable data center environment and must be provided with suitable electrical power . Application software and the underlying hardware inevitably require application, user and confi guration data to provide useful service. Human staff is required to provision and maintain the data, software, hardware, and supporting infrastructure; enterprise policies defi ne the interactions between ingredients and guide the actions of human staff. In addition to ordinary single failures of, say, hardware components, physical systems are vulnerable to force majeure or disaster events, like earthquakes, fi res, and fl oods, which can simultaneously impact multiple ingredients or components. This is illustrated in Figure 14.2 , repeated from Figure 3.4 . All of these ingredients are subject to risks, which can compromise ingredient availability, thereby impacting end user service. </p><p> Traditionally, enterprises broadly factored accountability for failures and outages into three broad buckets: product attributable, customer (or enterprise or user) attribut-able, and externally attributable. Figure 14.3 (repeated from Figure 4.1 ) visualizes the traditional factorization of accountability by ingredients. Accountability for each ingre-dient maps across the three traditional categories as follows:</p><p> Product suppliers are primarily accountable for the hardware and software they supply, and the ability of that hardware and software to interwork with other systems via defi ned application protocol payloads. </p><p> Customers are accountable primarily for the physical security, temperature, humidity, and other environmental characteristics of the facility where the hard-ware is installed, as well as for providing necessary electrical power and IP network connectivity. Customers are also responsible for their application, user, and confi guration data, as well as for the operation policies and human staff. </p><p> Figure 14.2. Eight - Ingredient Plus Data Plus Disaster (8i + 2d) Model. </p><p>System</p><p>Software</p><p>Hardware</p><p>Environment</p><p>Power</p><p>Human</p><p>PolicyApplicationPayload</p><p>IP Network</p><p>Data</p><p>Force majeure and disaster events</p></li><li><p>300 SUMMARY</p><p> External : Data centers are inherently vulnerable to force majeure and disaster events like earthquakes, tornadoes, fi res, and so on. As these risks are not appro-priately attributed to either the customer or the supplier, they are placed in this external attributability category. </p><p> Cloud computing fundamentally changes the accountability model because there is no longer a monolithic customer who buys and operates equipment. Instead, there is a cloud service provider who owns and operates cloud computing facilities (analogous to a landlord), and a cloud consumer who leases those computing facilities (analogous to a rental tenant). Thus, the accountabilities that were solely the responsibility of the customer in the traditional model must now be split between the cloud consumer and the cloud service provider. The exact division is determined by the particular cloud service model (i.e., infrastructure as a service [IaaS], platform as a service [PaaS], and software as a service [SaaS]); Figure 14.4 (repeated from Figure 10.7 ) gives a typical breakdown of accountabilities.</p><p> Cloud service providers are responsible for reliable operation of the compute, storage, and networking equipment (including load balancers and security appli-ances) in their data centers which host their cloud service offering. Along with responsibility for the hardware itself and the base software, the cloud service provider has responsibility for providing electrical power, highly reliable IP networking, and maintaining a physically secure data center with acceptable temperature, humidity, and other environmental conditions. The cloud service provider is also responsible for the human maintenance staff, contractors, and suppliers who support that data center, as well as the operational policies </p><p> Figure 14.3. Traditional Outage Attributability. </p><p>Data Center</p><p>Software</p><p>Hardware</p><p>Environment</p><p>PowerHuman</p><p>Policy</p><p>ApplicationPayload</p><p>IP Network</p><p>Product-attributable outages are primarilytriggered by:</p><p>1. the system design, hardware, software,components, or other parts of the system,</p><p>2. scheduled outage necessitated by thedesign of the system, or </p><p>Customer-attributable outages areprimarily triggered by:</p><p>1. customers [service provider] proceduralerrors,</p><p>2. office environment, for example power,grounding, temperature, humidity, orsecurity problems, or ...</p><p>External-attributable outages are causedby natural disasters such as tornadoes orfloods, and outages caused by third partiesnot associated with the [customer/serviceprovider] or the [supplier]</p><p>Data Customer is responsible for their own data</p><p>Force Majeure and Disaster Events</p></li><li><p>FACTORING SERVICE DOWNTIME 301</p><p>followed by those people when designing, operating, and maintaining the data center. </p><p> Software suppliers are responsible for delivering and supporting the software running on the virtualized platform. Note that there are often many software suppliers contributing different platform and application software components, and some of the software may even be supplied by the cloud service provider, the cloud consumer, or both. </p><p> The cloud consumer is the enterprise (or individual) who pays for the cloud services. The cloud consumer is responsible for their own enterprise or appli-cation data (e.g., user records and inventory data) and service confi guration, such as fi rewall settings and load balancer policies. The cloud consumer is also respon-sible for the staff that provisions their data (e.g., adding users and manipulat-ing enterprise data) and operates the application. In addition, the cloud consumer is responsible for assuring that force majeure risks are adequately mitigated via service continuity and disaster recovery planning, as well as georedun-dancy. While cloud service providers offer the services necessary to construct robust georedundancy confi gurations and disaster recovery plans, the cloud consumer has primary responsibility for business continuity planning of their service and data. </p><p> 14.3 FACTORING SERVICE DOWNTIME </p><p> The outage accountability model of cloud computing can also be factored based on process areas as shown in Figure 14.5 (repeated from Figure 4.5 ). The risks to software, </p><p> Figure 14.4. Sample Outage Accountability Model for Cloud Computing. </p><p>Cloud Data Center</p><p>Software</p><p>Hardware</p><p>EnvironmentPower</p><p>ApplicationPayload</p><p>Human</p><p>PolicySoftware</p><p>IP Network</p><p>Data</p><p>Data</p><p>Human</p><p>Policy</p><p>Force majeure and external events</p><p>Cloud consumer is responsible for their staff, operational </p><p>policies, and the correctness of their data</p><p>Software supplier is responsible for their software and </p><p>interworking with other elements via IP networking</p><p>Cloud service provider isresponsible for their data center, as well as for the </p><p>virtualized hardware platforms and preservation of cloud </p><p>consumers data</p><p>Cloud consumer is responsible for arranging to mitigate force </p><p>majeure or external events impacting individual data </p><p>centers hosting their application</p></li><li><p>302 SUMMARY</p><p>hardware, and payload ingredients are primarily managed and mitigated via the design for reliability and quality processes of the equipment and application suppliers. The risks to power, environment, and IP networking infrastructure are primarily considered in the context of data center best practices and standards like [Uptime] and [TIA942] . The data, human, and policy ingredients are addressed via IT service management best practices and standards like ITIL, ISO/IEC 20000, and COBIT. And the risk of force majeure events is mitigated via business continuity and disaster recovery plans. While IT service management generally...</p></li></ul>

Recommended

View more >