[ieee 2013 13th ieee/acm international symposium on cluster, cloud and grid computing (ccgrid) -...

2
An Autonomic and Scalable Management System for Private Clouds Matthieu Simonin , Eugen Feller , Anne-C´ ecile Orgerie , Yvon J´ egou and Christine Morin INRIA, Myriads team, IRISA, Rennes - France {matthieu.simonin, eugen.feller, yvon.jegou, christine.morin}@inria.fr CNRS, Myriads team, IRISA, Rennes - France [email protected] Abstract—Snooze is an open-source scalable, autonomic, and energy-efficient virtual machine (VM) management framework for private clouds. It allows users to build compute infrastruc- tures from virtualized resources. Particularly, once installed and configured, it allows its users to submit and control the life-cycle of a large number of VMs. For scalability, the system relies on a self-organizing hierarchical architecture. Moreover, it implements self-healing mechanisms in case of failure to enable high availability. It also performs energy- efficient distributed VM management through consolidation and power management techniques. This poster focuses on the experimental validation of two main properties of Snooze: scalability and fault-tolerance. Keywords-autonomic computing; cloud computing; resource management; scalability; fault tolerance I. I NTRODUCTION The ever growing appetite of new applications for Cloud resources leads to managing resources at large scale while providing performance isolation and efficient use of underly- ing hardware [1]. Resource management is challenging and requires autonomic mechanisms to deal with scalability and fault-tolerance [2]. Autonomic systems are self-managing, which is a highly desired feature for large-scale distributed systems such as Clouds. In this poster, we will present an extended validation of Snooze [3], [4], an autonomic and energy-efficient man- agement system for private clouds. Embedding a self- configuring hierarchical architecture, Snooze provides self- healing mechanisms that are able to rebuild the hierarchy on the fly in the event of failures. II. SNOOZE ARCHITECTURE Previous work has shown that hierarchical architectures can greatly contribute to system scalability [5], [1]. Fol- lowing this principle of splitting the knowledge among independent managers, Snooze is based on an autonomic hierarchical architecture shown in Figure 1. Servers (i.e. physical nodes) are represented by light gray rectangles. Each computing resource is managed by a Local Con- troller (LC) which monitors the VMs of the node and enforces the VM life-cycle and node management com- mands coming from higher levels of the hierarchy. A Group Manager (GM) is in charge of managing a subset of LCs. It integrates energy-management mechanisms such as overload Figure 1. Snooze architecture / underload mitigation and VM consolidation policies. A Group Leader (GL) manages the GMs, it assigns the LCs to GMs, and it deals with clients VM submission requests. To support self-configuration and healing, Snooze inte- grates bi-directional heartbeat protocols at all levels of the hierarchy. When a system service is started on a node it is statically configured to become either a LC or a GM. In the example depicted in Figure 1, we chose to have three GMs but this number is defined by the administrator depending on the desired level of redundancy (for fault- tolerance and performance). When the services boot, the first step in hierarchy construction involves the election of a GL among the GMs. After the GL election, other GMs need to register with it. For a LC to join the hierarchy, it first needs to discover the current GL and get a GM assigned by contacting the GL. Once it is assigned to a GM, it can register with it. Self-healing is performed at all levels of the hierarchy. It involves the detection of missing heartbeats and the recovery from LC, GM, and GL failures. VM and LC resource utilization changes over time. In order to support VM management decisions such as VM dispatching, placement, underload/overload mitigation, and VM consolidation, VM and LC monitoring is performed at all layers of the system. At the computing layer VMs are monitored and VM resource utilization is periodically 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing 978-0-7695-4996-5/13 $26.00 © 2013 IEEE DOI 10.1109/CCGrid.2013.44 198

Upload: c

Post on 14-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) - Delft (2013.5.13-2013.5.16)] 2013 13th IEEE/ACM International Symposium on Cluster,

An Autonomic and Scalable Management System for Private Clouds

Matthieu Simonin∗, Eugen Feller∗, Anne-Cecile Orgerie†, Yvon Jegou∗ and Christine Morin∗∗INRIA, Myriads team, IRISA, Rennes - France

{matthieu.simonin, eugen.feller, yvon.jegou, christine.morin}@inria.fr†CNRS, Myriads team, IRISA, Rennes - France

[email protected]

Abstract—Snooze is an open-source scalable, autonomic, andenergy-efficient virtual machine (VM) management frameworkfor private clouds. It allows users to build compute infrastruc-tures from virtualized resources. Particularly, once installedand configured, it allows its users to submit and controlthe life-cycle of a large number of VMs. For scalability, thesystem relies on a self-organizing hierarchical architecture.Moreover, it implements self-healing mechanisms in case offailure to enable high availability. It also performs energy-efficient distributed VM management through consolidationand power management techniques. This poster focuses onthe experimental validation of two main properties of Snooze:scalability and fault-tolerance.

Keywords-autonomic computing; cloud computing; resourcemanagement; scalability; fault tolerance

I. INTRODUCTION

The ever growing appetite of new applications for Cloud

resources leads to managing resources at large scale while

providing performance isolation and efficient use of underly-

ing hardware [1]. Resource management is challenging and

requires autonomic mechanisms to deal with scalability and

fault-tolerance [2]. Autonomic systems are self-managing,

which is a highly desired feature for large-scale distributed

systems such as Clouds.

In this poster, we will present an extended validation of

Snooze [3], [4], an autonomic and energy-efficient man-

agement system for private clouds. Embedding a self-

configuring hierarchical architecture, Snooze provides self-

healing mechanisms that are able to rebuild the hierarchy on

the fly in the event of failures.

II. SNOOZE ARCHITECTURE

Previous work has shown that hierarchical architectures

can greatly contribute to system scalability [5], [1]. Fol-

lowing this principle of splitting the knowledge among

independent managers, Snooze is based on an autonomic

hierarchical architecture shown in Figure 1. Servers (i.e.

physical nodes) are represented by light gray rectangles.

Each computing resource is managed by a Local Con-troller (LC) which monitors the VMs of the node and

enforces the VM life-cycle and node management com-

mands coming from higher levels of the hierarchy. A GroupManager (GM) is in charge of managing a subset of LCs. It

integrates energy-management mechanisms such as overload

���������

��� �������

�������������

���������

��� �������� ��� ����������� ��������

������������� ������������� ������������� �������������

���� �� �� �� ������ ��

����������

�� �������������

��������

���������

Figure 1. Snooze architecture

/ underload mitigation and VM consolidation policies. A

Group Leader (GL) manages the GMs, it assigns the LCs

to GMs, and it deals with clients VM submission requests.

To support self-configuration and healing, Snooze inte-

grates bi-directional heartbeat protocols at all levels of the

hierarchy. When a system service is started on a node it

is statically configured to become either a LC or a GM.

In the example depicted in Figure 1, we chose to have

three GMs but this number is defined by the administrator

depending on the desired level of redundancy (for fault-

tolerance and performance). When the services boot, the first

step in hierarchy construction involves the election of a GL

among the GMs. After the GL election, other GMs need

to register with it. For a LC to join the hierarchy, it first

needs to discover the current GL and get a GM assigned

by contacting the GL. Once it is assigned to a GM, it can

register with it. Self-healing is performed at all levels of the

hierarchy. It involves the detection of missing heartbeats and

the recovery from LC, GM, and GL failures.

VM and LC resource utilization changes over time. In

order to support VM management decisions such as VM

dispatching, placement, underload/overload mitigation, and

VM consolidation, VM and LC monitoring is performed

at all layers of the system. At the computing layer VMs

are monitored and VM resource utilization is periodically

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

978-0-7695-4996-5/13 $26.00 © 2013 IEEE

DOI 10.1109/CCGrid.2013.44

198

Page 2: [IEEE 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) - Delft (2013.5.13-2013.5.16)] 2013 13th IEEE/ACM International Symposium on Cluster,

transferred by the LCs to their assigned GM. At the man-

agement layer, to keep monitoring light and scalable, GMs

periodically send summary information to the GL including

the aggregated resource utilization for all the LCs they

manage.

Snooze is written in Java. It currently comprises 15,000

lines of highly modular code. It is distributed in open-source

under the GPL v2 license [6]. It relies on the libvirt library

which provides a uniform interface to most of the modern

hypervisors [7].

III. EXPERIMENTAL EVALUATION

Snooze has been validated on Grid’5000, the French

experimental test-bed to support experiment-driven research

in all areas of computer science related to parallel, large-

scale or distributed computing [8]. Grid’5000 comprises

6,500 cores geographically distributed in 10 sites linked

with a dedicated gigabit network. The platform supports in-

advance reservations and users are able to deploy their own

environment (i.e. operating system) on the reserved nodes.

In contrast to [4] which only shows the impact of Snooze’s

distributed VM management, in this poster, experiments

were conducted to evaluate the scalability and fault tolerance

of the Snooze’s hierarchy. First, to test the scalability in

number of hosted VMs, we launched an experiment with

1,000 VMs. For this experiment, we used 126 physical nodes

from 4 different Grid’5000 sites: 1 GL, 8 GMs and 117

LCs (hosting the 1,000 VMs). For this experiment, it took

about 15 minutes to deploy the nodes in Grid’5000 and

completely build the hierarchy from scratch: deployment of

a Debian environment on the managing nodes, NFS and

SSH configuration, installation of Snooze code, and self-

configuration of the hierarchy (GL election, GM allocation

for each LC). Then, it took about 15 minutes to run the

1,000 VMs: dispatching of the VMs onto LCs, deployment

of the VM images, and starting of the 1,000 VMs.

Then, we perform a second experiment to test the scala-

bility of the hierarchy itself. We launched progressively 50

GMs (one by one), and we measured the network traffic

on the GL incurred by the self-configuration (GL election,

and heartbeat mechanism initialization among the GMs)

and self-healing mechanisms (heartbeats messages). Figure 2

shows these measurements. As we can see, the amount of

traffic is proportional to the number of GMs, but it remains

light (about 25 kB/s for 50 GMs) because GMs only send

aggregated data for all their LCs to the GL.

Similar experiments were performed to evaluate the scal-

ability of the other levels of the hierarchy (for example,

one GM with 350 LCs). We also switch off nodes to cause

failures in the hierarchy and we measured the time to rebuild

the hierarchy depending on the failure case (crash of a GM,

the GL or an LC, or crash of several of them). In all the cases

(except when all the hierarchy nodes are killed of course),

Snooze was able to rebuild the hierarchy in less than one

Figure 2. Network traffic caused on the GL by the progressive start of 50Group Managers

minute even for large-scale experiments. These experiments

are not described here in detail due to lack of space.

IV. CONCLUSION

Snooze [3], [4] is a virtual machine management frame-

work based on a hierarchical approach. It is open-source [6],

and can thus be used by any research team dealing with

Cloud computing, energy efficiency, and autonomic resource

management in large-scale distributed systems. This poster

will present experimental validations of two key properties

required by Cloud resource managers: scalability and fault-

tolerance.

REFERENCES

[1] A. Gulati, G. Shanmuganathan, A. Holler, and I. Ahmad,“Cloud-scale resource management: challenges and tech-niques,” in USENIX conference on Hot topics in cloud com-puting (HotCloud), 2011.

[2] R. Buyya, R. Calheiros, and X. Li, “Autonomic Cloud Com-puting: Open Challenges and Architectural Elements,” in Inter-national Conference on Emerging Applications of InformationTechnology (EAIT), 2012, pp. 3–10.

[3] E. Feller, C. Rohr, D. Margery, and C. Morin, “Energy Man-agement in IaaS Clouds: A Holistic Approach,” in CLOUD:IEEE International Conference on Cloud Computing, 2012.

[4] E. Feller, L. Rilling, and C. Morin, “Snooze : A Scalable andAutonomic Virtual Machine Management Framework for Pri-vate Clouds,” in CCGrid: IEEE/ACM International Symposiumon Cluster, Cloud, and Grid Computing, 2012.

[5] S. Perera and G. D., “Enforcing user-defined management logicin large scale systems,” in IEEE Congress on Services, 2009,pp. 243–250.

[6] Snooze, “http://snooze.inria.fr,” GPL v2 licence, 2012.

[7] Red Hat, “libvirt: The virtualization API,” http://libvirt.org/,2012.

[8] R. Bolze et al., “Grid’5000: A Large Scale And Highly Recon-figurable Experimental Grid Testbed,” International Journal ofHigh Performance Computing Applications, vol. 20, no. 4, pp.481–494, 2006.

199