when the cloud is a rockin: high availability in apache cloudstack

17
The Cloud Specialists When the Cloud is a Rockin': High Availability in Apache CloudStack shapeblue.com @ShapeBlue John Burwell @john_burwell VP of Software Engineering

Upload: john-burwell

Post on 16-Apr-2017

95 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: When the Cloud is a Rockin: High Availability in Apache CloudStack

The Cloud Specialists

When the Cloud is a Rockin': High Availability

in Apache CloudStackshapeblue.com • @ShapeBlue

John Burwell • @john_burwellVP of Software Engineering

Page 2: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

A b o u t M e

• VP of Software Engineering @ ShapeBlue• Member, Apache CloudStack PMC (June

2013)• Ran operations and designed automated

provisioning for analytic/virtualization clouds• Led architectural design and server-side

development of a SaaS physical security platform

Page 3: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Rohit Yadav• Abhi Prateek• Murali Reddy• Boris Stoyanov

T h e re ’ s N o “ I ” i n Te a m

Page 4: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

M o t i v a t i o n

Currently [sic] KVM HA works by monitoring an NFS based heartbeat file and it can often fail whenever this network share becomes slower, causing the hypervisors to reboot. … This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors doing it?

- Nux15 October 2015

CLOUDSTACK-8943

Page 5: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Limited to hosts and VMs using NFS storage

• Tight coupling between the Agent and HighAvailabilityManager

• False positives which interrupt the operation healthy resources

L i m i t a t i o n s / I s s u e s

Inconsistent behavior prevents operators from trusting KVM HA

Page 6: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

B u i l d v s . B u y

Pros• Integration with the

CloudStack control plane and abstractions

• Simpler configuration• Integrated

instrumentation and logging

Cons• Complex mechanism to

implement, test, and maintain

• Foregoing a proven, battle tested implementation

• Less functionality initially

A robust infrastructure control plane must include the ability to recover and fence resources

Page 7: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

H A Re s o u rc e M a n a g e m e n t S e r v i c e

HA Resource Management

Service

Plugin

• Manages per resource FSM• Persistence• Concurrency/Back Pressure• Common Business Logic

• Resource-specific Business LogicHA Provider

Resource

Page 8: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Loose coupling between resources and HA

• Consolidate orthogonal HA concerns• Prove the correct operation of the HA

Resource Management Service and HA Providers independently

• Leverage CloudStack abstractions• Develop a model for architectural

evolution

G o a l s

To create a trustworthy system, operational correctness must be the prevailing priority

Page 9: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Health Check: An idempotent check of a resource to directly verify its proper operation

• Activity Check: An idempotent check to observe the side-effects of a resource’s proper operation

• Eligibility: An idempotent determination of a resource’s eligibility for HA management

• Recovery: Take potentially destructive actions to bring a resource back to a healthy state

• Fence: Take potentially destructive actions to prevent an unrecoverable resource from impacting the health of its peers

Te rm s a n d C o n c e p t s

Page 10: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• DISABLED: The resource is part of a partition where HA operations have been disabled or have been disabled for the resource.

• INITIALIZING: The initial health and eligibility of the resource for HA management is currently being determined.

• AVAILABLE: The resource is available based on the passage of the most recent health check and it containing partition has an HA state of ACTIVE.

• INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its current state does not support HA check and/or recovery operations.

• SUSPECT: The resource pending an activity check due to failing its most recent health check.

• CHECKING: An activity check is currently being performed on the resource.

• RECOVERING: Recovery operations are in-progress to bring the resource back to a healthy state.

• DEGRADED: The resource cannot be managed by the control plane but passed its most recent activity check indicating that the resource is still servicing end-user requests

• FENCED: The resource is not operating normally and automated attempts to recover it failed. Manual operator intervention is required to recover the resource.

S t a t e s

Page 11: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

S t a t e M o d e l

Page 12: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

H A P r o v i d e r I n t e r f a c e

public interface HAProvider<R> extends Adapter {

ResourceType resourceType();

ResourceSubType resourceSubType();

boolean isEligible(R r);

boolean isHealthy(R r) throws HACheckerException;

boolean hasActivity(R r) throws HACheckerException;

boolean recover(R r) throws HARecoveryException;

boolean fence(R r) throws HAFenceException;

}

Page 13: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

K V M H o s t H A

KVM Host HA Provider

Storage Processor

ActivityCheck

Host

Recover /Fence using

OOBM

KVM Agent

HealthCheck

Page 14: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

C o n c u r re n c y M o d e l• Producer/consumer model• Size bounded work queues• Time bounded operations• Fixed sized thread pools

• Idempotent operations are ephemeral

• Non-Idempotent operations are managed through AsyncJobManager using a new time-delayed dispatcherHA operations cannot overwhelm the control plane

Page 15: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Focused on KVM host HA• Initial implementation started —

validating the design• Draft specification — functional spec

will be published in the next 1-2 weeks

• Robust unit and integration test model to verify both the service and KVM host HA provider

• Delivery of the first version in July 2016 for inclusion in 4.10 (August 2016)

S t a t u s

Page 16: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

• Support Nested HA Resources• Instrumentation• Migrate VM HA to the HA Resource

Management Service

W h a t ’ s N ex t

Page 17: When the Cloud is a Rockin: High Availability in Apache CloudStack

C l i c k t o e d i t

The Cloud Specialists

ShapeBlue.com @ShapeBlue

Questions? Comments?

#cloudstackworks