disco: unified provisioning of distributed computing platforms...

8
DISCO: Unified Provisioning of Distributed Computing Platforms in the Cloud P. Harsh 1 , and T. M. Bohnert 1 1 InIT, Zurich University of Applied Sciences (ZHAW), Winterthur, Zurich, Switzerland AbstractBig data is ubiquitous. The modus operandi recently has been to collect all possible data and then pro- cess later. Advances in distributed systems, fault tolerance and resiliency together with computational shift to cloud computing has enabled cheap storage and fast processing of huge amounts of data. Cluster computing has been there for decades, clouds bring the benefits of on-demand, pay-as-you- go, multi-tenancy, rapid elasticity to cluster/grid computing. There is an explosion of high quality open source distributed processing platforms (Spark, Hadoop, Storm, etc.). Further- more, there is a plethora of add-ons for these platforms to enable specialized data processing needs of the scientific and business community. A unified semi-automated provisioning mechanism is necessary to reduce the complexity in using such platforms. In this paper we present our architecture and rationale for such a provisioning system which would take into account the data processing requirements of the researcher, and provide a unified experience for on-demand creation of a specialized data processing platform over IaaS clouds. We further show how our solution incorporates all major tenets of the cloud. Keywords: cloud computing, provisioning, big data 1. Introduction Big-data [1] is defined as the increase in the volume, variety, velocity of data to such an extent that it becomes increasingly difficult to process using common statistical methods, and analyze through traditional databases. The world is becoming increasingly connected, with efforts being made by large technology firms to bring basic connectivity to every nook and corner [2] [3] of this planet. With cyber- physical systems [4], Internet of Things (IoT), connected cars [5] and home automation - billions of devices will be connected to the Internet. There will be an explosion of data which will provide huge opportunities as well as challenges to the entities hoping to make use of them. These big-data sources will present challenges of collection, storage, privacy, security, etc. but despite these, they will prove invaluable for future smart city planning, better and targeted advertisements, timely health care regimen, etc. The potential of such a scenario is limitless. Distributed systems and processes, in particular cloud computing, have provided a reasonable technology platform for storage and processing of big data. Scientific community has traditionally relied on specialized super computers, huge government funded data clusters and grids for processing massive data sets in the past. In addition to scientific datasets, modern day general consumers are constantly generating big data through their social networking activities and online behavior. Naturally, businesses are increasingly becoming interested in using the power of data for better marketing of their services. Clouds provide businesses with a cheap alternative (to scientific grids) for processing big-data. Last few years have seen an explosion of mature open source toolkits and platforms that allow scientific and busi- ness community perform distributed computing efficiently. There are numerous plugins [6] available for any popular platform for supporting variety of use cases. Every platform comes up with it’s own challenges, configuration optimiza- tions, provisioning specificities, and additionally, there is the challenge of making the platform operate seamlessly in a cloud environment. Our work focuses on reducing the challenges in orches- trating and provisioning a desired distributed computing solution in the cloud from an end user perspective. We will outline the architecture of our distributed computing provisioning framework that would allow the user to specify the nature of the processing task, capacity requirements, and other relevant parameters following which the provisioning system takes over, creates the cluster computing environment over cloud, and returns necessary endpoints to the researcher / scientist for data uploads, job submissions, and result collection. We will also show how the platform could hook into cloud billing solutions so that metered consumption can be facilitated. The rest of the paper is organized as follows: candidate data processing platforms and tools would be analyzed in section 2, architecture of our framework - DISCO will be presented in section 3, and work-flow analysis of DISCO will be carried out in section 4. We will present our analysis of a few selected related projects in section 5, and then we will conclude with a summary and plans for the future course of our work. 2. Popular Platforms Distributed computing world has used map-reduce pro- gramming paradigm since a few decades, but it was made popular in the mainstream by the Google MapReduce [7] Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 407

Upload: buihuong

Post on 20-Mar-2018

229 views

Category:

Documents


2 download

TRANSCRIPT

DISCO: Unified Provisioning of Distributed Computing Platformsin the Cloud

P. Harsh1, and T. M. Bohnert11InIT, Zurich University of Applied Sciences (ZHAW), Winterthur, Zurich, Switzerland

Abstract— Big data is ubiquitous. The modus operandirecently has been to collect all possible data and then pro-cess later. Advances in distributed systems, fault toleranceand resiliency together with computational shift to cloudcomputing has enabled cheap storage and fast processing ofhuge amounts of data. Cluster computing has been there fordecades, clouds bring the benefits of on-demand, pay-as-you-go, multi-tenancy, rapid elasticity to cluster/grid computing.There is an explosion of high quality open source distributedprocessing platforms (Spark, Hadoop, Storm, etc.). Further-more, there is a plethora of add-ons for these platforms toenable specialized data processing needs of the scientific andbusiness community. A unified semi-automated provisioningmechanism is necessary to reduce the complexity in usingsuch platforms. In this paper we present our architectureand rationale for such a provisioning system which wouldtake into account the data processing requirements of theresearcher, and provide a unified experience for on-demandcreation of a specialized data processing platform over IaaSclouds. We further show how our solution incorporates allmajor tenets of the cloud.

Keywords: cloud computing, provisioning, big data

1. IntroductionBig-data [1] is defined as the increase in the volume,

variety, velocity of data to such an extent that it becomesincreasingly difficult to process using common statisticalmethods, and analyze through traditional databases. Theworld is becoming increasingly connected, with efforts beingmade by large technology firms to bring basic connectivityto every nook and corner [2] [3] of this planet. With cyber-physical systems [4], Internet of Things (IoT), connectedcars [5] and home automation - billions of devices willbe connected to the Internet. There will be an explosionof data which will provide huge opportunities as well aschallenges to the entities hoping to make use of them.These big-data sources will present challenges of collection,storage, privacy, security, etc. but despite these, they willprove invaluable for future smart city planning, better andtargeted advertisements, timely health care regimen, etc. Thepotential of such a scenario is limitless.

Distributed systems and processes, in particular cloudcomputing, have provided a reasonable technology platformfor storage and processing of big data. Scientific community

has traditionally relied on specialized super computers, hugegovernment funded data clusters and grids for processingmassive data sets in the past. In addition to scientific datasets,modern day general consumers are constantly generating bigdata through their social networking activities and onlinebehavior. Naturally, businesses are increasingly becominginterested in using the power of data for better marketingof their services. Clouds provide businesses with a cheapalternative (to scientific grids) for processing big-data.

Last few years have seen an explosion of mature opensource toolkits and platforms that allow scientific and busi-ness community perform distributed computing efficiently.There are numerous plugins [6] available for any popularplatform for supporting variety of use cases. Every platformcomes up with it’s own challenges, configuration optimiza-tions, provisioning specificities, and additionally, there is thechallenge of making the platform operate seamlessly in acloud environment.

Our work focuses on reducing the challenges in orches-trating and provisioning a desired distributed computingsolution in the cloud from an end user perspective. Wewill outline the architecture of our distributed computingprovisioning framework that would allow the user to specifythe nature of the processing task, capacity requirements, andother relevant parameters following which the provisioningsystem takes over, creates the cluster computing environmentover cloud, and returns necessary endpoints to the researcher/ scientist for data uploads, job submissions, and resultcollection. We will also show how the platform could hookinto cloud billing solutions so that metered consumption canbe facilitated.

The rest of the paper is organized as follows: candidatedata processing platforms and tools would be analyzed insection 2, architecture of our framework - DISCO will bepresented in section 3, and work-flow analysis of DISCOwill be carried out in section 4. We will present our analysisof a few selected related projects in section 5, and then wewill conclude with a summary and plans for the future courseof our work.

2. Popular PlatformsDistributed computing world has used map-reduce pro-

gramming paradigm since a few decades, but it was madepopular in the mainstream by the Google MapReduce [7]

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 407

paper that first came out in 2004. Since then rapid progresseswere made in the maturity of the paradigm after Yahoodeveloped Hadoop for optimizing its search engine pro-cesses, which in part were also inspired from the Google FileSystem [8] work1. Hadoop has gained popularity ever since,with an impressive ecosystem of tools and enhancementsaround it. In recent times, the focus of data analysis isgetting broadened, to not only include batch processingtasks with stored data objects, but also handle non map-reduce jobs, and streaming data processing for more timesensitive (real-time) analytics. Stream processing is gainingsignificance as businesses want more real time trends fromdaily or even hourly data. Yahoo! uses S4 [9] for analyzingusers’ query submissions and click-through rate, Twitteruses Storm for real time classification and organizationof tweets [10]. In some incident management systems,streaming data processing forms the core of the offering.Although there are numerous plugins available that allowmost of the popular data processing platforms to support allkinds of the computations (batch, stream, graph, query, etc.)tasks, we will look into these following candidate platformsfor supporting (initially) within our distributed computingprovisioning framework DISCO. DISCO framework willsupport the full Berkeley Data Analytics Stack (BDAS)2 inthe later releases.

2.1 Apache HadoopApache Hadoop is a mature open source map-reduce plat-

form with a vibrant ecosystem of tools around it to support amyriads of data processing needs. The Hadoop map-reduceecosystem consists of projects such as Pig, Mahout (formachine learning), Hive, etc. The Hadoop distribution usesYARN [11] as cluster manager from version 2 onwards,which is a monolithic (single-stage) scheduler. Applicationand resource scheduling were built inside Hadoop core inprevious versions. The datasets are hosted in a distributed filesystem (HDFS) [12] that ensures recoverability in the faceof node failures. Due to separation of scheduler logic fromversion 2 onwards, Apache Hadoop has started supportingnot only map-reduce tasks, but other tasks as well suchas matrix, graph, machine learning computation, etc. in amore efficient manner. In Hadoop, when a client submits atask, the YARN resource manager schedules and managesthe task.

2.2 Apache SparkApache Spark [13] is a distributed data processing plat-

form from UC Berkeley’s amplab3 and is the key piece in theBerkeley Data Analytics Stack (BDAS) [14]. Spark runs over

1part of the hadoop history was based on Gigaom articlehttps://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ [retrieved: 11.03.2015]

2BDAS: https://amplab.cs.berkeley.edu/software/ [accessed: 18-03-2015]3amplab: https://amplab.cs.berkeley.edu/

a distributed memory-centric storage system called Tachyon[15], but also supports HDFS, Amazon S3, and GlusterFS.Spark execution engine supports not only map-reduce batchtasks, but stream processing, graph computations, machinelearning, query processing, etc. Being largely based uponin-memory processing, Spark achieves higher processingspeedup by manifolds compared to Apache Hadoop, es-pecially for repetitive tasks through intelligent use of in-memory caches. The datasets are distributed as read-onlyresilient distributed datasets (RDDs). Spark runs over Mesos[16] which provides 2 level scheduling which enables varietyof applications to manage in application task scheduling ina fine-grained manner.

2.3 Apache StormApache Storm4 is an open source, real time data process-

ing platform originally released by Twitter after acquiringBackType [17]. It allows processing of infinite streamsof data using tuple-at-a-time computational model. It usesNimbus as the master node, Apache Zookeeper5 for clustercoordination and a number of Supervisors which are theworker nodes. The Storm job topology consists of Spouts andBolts arranged in a directed acyclic graph (DAG). Spouts aresimply sources of streaming data. And bolts process a smallbatch of tuples from the stream. Storm platform guaranteesdata processing through the network of spouts and bolts, andsupports at-least-once delivery, and transactional topologies[18].

3. DISCO ArchitectureThe DISCO architecture is influenced by the require-

ments of properly deploying and managing the candidatetechnologies listed in the previous section based on theuser requirements. The high-level architecture is shown inthe Figure 1. It shows the user facing elements, the coreplatform components, and also the relevant external cloudcomponents for completeness. The architecture provides foran integration point towards the cloud rating-charging-billingplatform Cyclops [19] [20] [21]. It supports both web-based(DISCO UI) as well as command line interfaces (DISCOCLI) towards end users. These communicate with the DISCOprovisioning platform via RESTful [22] HTTP calls. TheREST API Server is the only publicly exposed service. EveryREST call is properly authenticated and authorized, althoughthe authentication and authorization services are not shownin the architecture for brevity6.

4Apache Storm: https://storm.apache.org/5Apache Zookeeper: https://zookeeper.apache.org/6OAuth protocol can be used to secure the internal services that makes

up the core of the DISCO platform. Again, for brevity such inter servicesinteractions have been left out.

408 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

DISCO

Platform Template

Store

Configuration Store

NFSHDFS

Curated VM Image Index

HDFS

DISCO UIDISCO CLI

REST API Server

Configuration Management Server

Configuration Optimization

Log Analyzer

Configuration Optimizer

Change Planner

Actuator

Requirement Analysis

Deployment Planner

Platform AdaptersPlatform

AdaptersPlatform Adapters

OrchestratorCloud Driver

Requirements

Platform Choice

Plan

VM Image Store

HDFS

distributed computecluster A

distributed computecluster B

Log Server

Cloud ManagerCyclops RCB

Platform

Platform Catalog Store

Fig. 1: DISCO Architecture

3.1 Core Components - EssentialThe core of the platform is made up of these components:

Requirement Analysis ModuleAs the name suggests, the task of this module isto analyze the end users platform requirements,type of computation, resource requirements, natureof the data source, etc. to determine which ofthe supported distributed computing technologiesis best suited for their needs. The RequirementAnalysis module provides the technology selec-tion along with other parameters (capacity hints,placement requirements, etc.) to the DeploymentPlanner module.

Deployment Planner ModuleThe deployment planner creates the proper clouddeployment strategy, it uses the technology sug-gestion, together with resource requirements toquery the Platform Template Store, Curated VMImage Index, and Configuration Store to generatethe deployment template for the Orchestrator. Thismodule also initializes the base-line configurationof the soon to-be deployed distributed computecluster in the Configuration Management Server.

OrchestratorThe Orchestrator is responsible for deploymentof the selected distributed computing platformover the cloud. The orchestrator uses the CloudDriver to interface with the Infrastructure (IaaS)

Cloud Manager (popular open-source choices be-ing OpenStack7, CloudStack8, OpenNebula9, etc.),and sends the correct sequence of commands tocreate the collection of virtual machines (VMs), ina predefined order, with correct software packagesand configurations to make the data-processingservice usable by the requesting person. Once theVMs are properly deployed, it returns the deployedvirtual-compute-platform (VCP) entry points (IPs,other access details) to the calling environment.This way the access details are eventually passedon to the requesting user.

Configuration Management ServerThe Configuration Management Server keeps thecurrent and past (checkpoints) configuration valuesfor every VCP currently provisioned and main-tained within the DISCO environment. The VMsare pre-configured to periodically contact this ser-vice and get any updates to the configurationparameters. The rationale behind this is runtimeconfiguration optimization for long running ser-vices. The updated configuration can be appliedin the VMs if there are no active jobs within theVCP. This is subject to acceptable configurationmanagement policy agreed or specified by theuser10.

Configuration StoreThe baseline configuration for various distributeddata processing technologies are stored in thisdatabase.

Curated VM Image IndexThe DISCO platform keeps track of the preparedVM images for each of the supported technologiesin this index. This comes in handy during thedeployment planning stage. The entries depend alot on the underlying IaaS cloud platform.

Platform Template StoreThis is a collection of various deployment tem-plates for each of the supported data processingplatforms. The template is filled with missing de-tails before being sent to the Orchestrator moduledepending on the user requirements.

Platform Catalog StoreThis store keeps track of the state of currentlyactive virtual compute clusters managed throughthe DISCO framework. It stores the access details,relevant endpoints, health parameters, incidents /events, etc.

7OpenStack: http://www.openstack.org/8Apache CloudStackTM: http://cloudstack.apache.org/9OpenNebula: http://opennebula.org/10This functionality is very similar to features offered by popular con-

figuration management tools such as Puppet, Chef, Ansible, etc.

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 409

These are the bare minimum required modules for imple-menting a usable provisioning and management frameworkfor various distributed compute solutions.

3.2 Core Components - OptionalThe core of the platforms consists of these (optional)

value-addition components:Platform Adapters / Drivers

These driver programs enable the DISCO frame-work to periodically query the VCPs, which forthe initial prototype would be Hadoop, Spark orStorm clusters. Using these drivers, the provision-ing system could track cluster statistics, job status,etc. and could enable the REST API server to fetchon-demand status and statistics to be returned backto the user. This would enable us to design andimplement a more functional CLI and web UI forend users.

Configuration OptimizationConfiguration Optimization service will periodi-cally check log server entries from various VCPs,analyze them, and if errors or warnings are found,look into the applied configuration parameters, andpast check-pointed configurations (if available).This module will look into knowledge-base (notshown in the picture) to locate safe and optimalconfiguration parameters (from previous runs). Ifnot found, then the problem can be propagatedto the operators for manual intervention. The ideabehind this module borrows a lot from IncidenceManagement [23] [24] domain and aligns withthe widely accepted MAPE-K [25] loop referencemodel. The change planner plans the correct or-der of the configuration updates to be appliedamong various VMs, and services in them, whichcomprises the concerned VCP. The actuator sim-ply updates the Configuration Management servicewhich in turn checkpoints the previous values, andprepares them for storage into the knowledge-base.

3.3 External ComponentsFor sake of completeness, here we briefly describe the ex-

ternal components / services with which DISCO frameworkhas either required or optional dependencies.

Log ServerAll the VMs are pre-configured to send logs toa remote log server in addition to storing themlocally in the VM’s scratch space. The remote log-ging can be enabled on request by the user if theywish to enable run-time optimizations through theConfiguration Optimization module in the DISCO-core.

Cloud ManagerCloud manager is the external component that

not only enables life-cycle management of VMsrunning in the cloud, but also creation of theIaaS cloud from physical servers in the data-center.In our prototypical implementation, this is Open-Stack. Other popular cloud managers are Eucalyp-tus, CloudStack, OpenNebula, etc. DISCO interactswith various IaaS clouds through the appropriateCloud Driver which enables the Orchestrator tocommunicate with the cloud managers throughtheir APIs.

VM Image StoreAll customized VM images that allows DISCO toprovision and manage the requested (supported)distributed data-processing platform are kept in theVM Image Store. Most Cloud Management soft-wares, including OpenStack come with a preferredImage Store. In our case, this will be OpenStackGlance11.

NFS ServiceSometimes it become necessary to enable end usersto easily upload files for processing in the VCP. Insome cases (esp. Apache Spark), the files in all theworker nodes in the (virtual) cluster must reside atthe same path. This can be easily achieved if allVMs have a common remote network file system(NFS) [26] share mounted at the same absolutepath in their local file-system.

CyclopsCyclops [19] [20] is an open source frameworkthat allows custom rating-charging-billing solutionfor cloud based services. It exposes a REST APIfor external (non IaaS) services to send in metereddata. Using a combination or variety of metereddata, various billing strategies can be implemented.DISCO will integrate with Cyclops and send inany non standard, framework specific metered datainto it. This will enable measured, pay-as-you-goservice utilization cloud principle.

4. Workflows & Platform AnalysisDISCO framework’s unique features are -• unified provisioning• runtime configuration optimization• virtual cluster lifecycle management & enabling mea-

sured serviceThe sequence diagrams and algorithms shown here willanalyze how these features are operationalized within thearchitectural framework described earlier.

4.1 Unified ProvisioningSequence diagram shown in Figure 2 presents a highly

simplified work-flow involved in a unified provisioning of

11OpenStack Glance: http://glance.openstack.org/

410 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

any selected distributed computing platform in the targetIaaS cloud. It presents the basic steps from user filling arequest form / questionnaire in the Web-UI and from clickingsubmit until the selected compute cluster is provisioned overthe cloud. Upon successful provisioning the access detailsfor the cluster is returned back to the user. The various

Fig. 2: Unified Provisioning Work-flow

other usual steps such as data preparation and on-boarding,authentication and authorization, etc. are not shown in thesteps for simplicity.

4.2 Runtime OptimizationThe DISCO platform periodically looks for log entries

from all active virtual compute platforms (VCPs), and thenattempts to identify suboptimal state from them. The runtimeoptimization component searches the knowledge-base forhistorical occurrences of similar incidences, and preparesthe new configuration to help the VCP reach efficient state(if possible). If a discovered incident can not be resolvedby the module independently, then an alert is raised to theoperator, to bring in human in the loop. A new entry is thenentered by the operator in the knowledge-base so that similarincidents can be automatically handled in the future. Thepseudo-code for a possible platform runtime optimizationapproach is shown in Algorithm 1.

The sys-admin upon receipt of the alert would manuallyintervene, investigate the incident, and apply the fix manuallyto the virtual cluster. In this process he will update theresolution notes in the placeholder knowledge-base entryenabling similar incidences to be handled automatically.

4.3 Cluster Life-cycle Management & CloudPrinciples Enablement

The DISCO framework allows users to provision a dis-tributed compute cluster through a web interface or a CLIclient-tool. This enables on-demand self-service provisioningof any supported platform. The cluster is made up of anumber of virtual machines running over an IaaS cloud,therefore the usage measurement taken at the IaaS level

Data: Log entries from the log serverResult: Optimal VCP entity configuration to resolve

any identified adverse incident / eventinitialization;filter log messages to type WARN or ERROR from logfiles;foreach entry in WARN or ERROR list do

identify the VCP id for this log;collect all WARN or ERROR level messages forthis VCP instance;look in knowledge-base for the possible fix;if fix located then

create a patch configuration based onknowledge base entries / notes;create the patch application plan based on thenumber of VMs in the VCP;write the patched configuration and applicationplan into configuration management server;update the knowledge-base;

endelse

//no knowledge-base entry was located;create a ticket for the system admin;create a placeholder knowledge-base entry;send alert with ticket-id to administrator;//unresolved events are tracked inside theplatform catalog;create a pending resolution entry for this eventin the platform catalog store;//duplicate entries (if exist) are merged;

endend

Algorithm 1: Runtime Optimization Process

covers the major part of DISCO metering. Integration withCyclops would allow DISCO to send in non-IaaS metereddata into the charging and billing platform. Cyclops allowsmeasured use of the DISCO framework. On demand elastic-ity is supported by virtue of the cloud use. Dynamic scalingof the compute cluster is generally supported in all majoropen-source distributed computing platforms.

The compute cluster life-cycle management - from clusterdeployment, to configuration management, runtime opti-mizations, and cluster disposal when no longer needed, ishandled in the DISCO framework. Most of the processes andwork-flows are intuitive based on the architectural elementdescribed earlier, and are facilitated by the cluster statesand access data is maintained in the platform catalog store.The DISCO framework uses easy primitives and simplework-flows, an example of which is the way it handles theunresolved incident / event (as reported by the ConfigurationOptimization module) in the Platform Catalog Store. This isshown in Algorithm 2.

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 411

Data: Platform State Data from Platform Catalog StoreResult: Update of Platform State in the Platform

Catalog Storeif VCP event flag is set then

if VCP event is in unresolved state thenresend alert to the sysadmin;//this allows alerts to be sent periodically untilthe event is resolved;

endelse

//sysadmin has manually intervened and fixedthe errors in the VCP;//after which he has likely updated theknowledge-base with resolution notes for thisclass of events;//the manual update of the knowledge-basealready marks the event state as resolved;reset the VCP event flag;if Configurations in Configuration ManagementServer is not updated then

update the VCP configurations in theConfiguration Management Server;//this would make sure that the olderconfigurations are accidentally not appliedduring future configuration syncs;//ideally the new configurations for the VCPis immediately updated in the ConfigurationManagement Server.

endend

endAlgorithm 2: VCP Unresolved State Resolution Process

5. Related Projects and Service Offer-ings

Almost all major cloud providers support on-demand big-data analytics platform provisioning. There are a few open-source projects that look into similar service. A small subsetof prominent alternatives are discussed next.

5.1 OpenStack SaharaSahara [27] is an integrated, community driven, open-

source project in the OpenStack ecosystem where the prin-cipal goal is to provide Apache Hadoop as a service tousers. Their road-map includes providing support for ApacheSpark platform in the near future and an early prototypetowards that goal was demonstrated by Eurocom [28]. Theproject goals are similar to DISCO architecture, but isintimately tied to OpenStack cloud platform and its variousinternal services. The project allows user to choose certaincharacteristics of their hadoop cluster - cluster size, heapsize, etc. For data analytics, it supports a number of Hadoopplugins such as pig [29], hive [30] and allows upload of

custom jar files. Data files are assumed to reside in Swift12

object store. DISCO additionally plans to support continouscluster optimization wherever possible in addition to supportmultiple platforms over possibly many cloud environment.

5.2 Apache AmbariThe architecture of Apache Ambari [31] is similar to

the one proposed for DISCO. Ambari allows provisioning,managing and monitoring of Hadoop clusters. It tracks thehealth for worker nodes, and sends out alerts to sysadminsif a node is unreachable. It also handles configuration man-agement. Similar to the proposed DISCO features, Ambariprovides a REST interface and provides both Web-UI forinteracting with the core service, and various command lineclients. Conceptually, it is similar to bare-metal provisioningsystems like Foreman13 and Juju14 but designed specificallyfor the Hadoop provisioning. DISCO architecture is not tiedto any specific data processing platforms, although it willsupport Hadoop, Spark and Storm to start with. The run-timeconfiguration optimization and domain-specific encoding forplatform selection algorithm is novelty in DISCO.

5.3 Pivotal Big Data SuitePivotal’s Big Data Suite recently was made open-source

and the development in this effort is governed withinthe ambit of Open Data Platform (ODP)15. The principalgoal of ODP initiative is to prevent the fragmentation inthe Apache Hadoop ecosystem. The Pivotal’s Open DataPlatform provides Hadoop based big data processing so-lutions, governed by the same goals targeted by DISCO.The DISCO framework tries to go a bit farther by aimingto encode the scientific and research communities’ dataplatform selection logic into a decision making module toautomate this process; this would enable faster convergencetowards the optimal platform choice for DISCO’s clients.Furthermore, continuous configuration optimization in thelight of differentiated use cases is a unique feature in DISCO.Furthermore, the core goal of ODP and Pivotal’s Big DataSuite is faster monetization of big-data solutions and sup-porting enterprise business logic through promotion of datadriven applications. DISCO is geared towards scientific andresearch community, but in the process would facilitate datadriven applications in enterprises also.

5.4 Azure HDInsightMicrosoft’s popular cloud offering Azure recently started

supporting map-reduce data analytics called HDInsight16.Their platform leverages Hortonworks Data Platform (HDP)

12OpenStack Swift: http://docs.openstack.org/developer/swift/13http://theforeman.org/14https://jujucharms.com/15ODP: http://opendataplatform.org/16HDInsight: http://azure.microsoft.com/en-us/solutions/big-data/ [ac-

cessed 16-03-2015]

412 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

Hadoop distribution and provide support for ApacheHadoop, HBase and Storm platforms. Additionally theysupport Ambari17 for provisioning, monitoring, and man-agement of Hadoop clusters, Hive, Mahout18 for machinelearning, Pig and other projects from the Hadoop ecosystem.HDFS [12] is the standard file system in HDInsight. Itallows users to export results in Excel among other Microsoftsupported platform formats.

5.5 Amazon EMRAmazon supports data analytics through its Amazon

Elastic MapReduce (EMR) service. It enables data storedin Amazon S3 and DynamoDB to be processed in EMRclusters and supports many of the popular Apache Hadoopecosystem modules & plug-ins including Pig, Hive, Spark,among several others. Amazon EMR is well integrated withother Amazon EC2 solutions including virtual private cloud.It allows resizing of a running EMR cluster19. DISCO incomparison is an open source solution that aims to achievemost of the functionality, but using a modular, cloud plat-form agnostic approach. Our aim is not only to support map-reduce jobs but other statistical tasks using a semi automatedplatform selection process which is optimized for the taskat hand.

Brief AnalysisDISCO aims to bring distributed computing in the cloud to

scientific and business community through a highly intuitive,unified and semi-automated provisioning framework. Theproposed framework will encode various scientific domainexpertise in the implemented algorithms for platform selec-tion, configuration management subsystems, and will havean element of learning built-in to continually update theoptimal configuration knowledge-base based on past andpresent run experiences. This is in stark contrast to otheropen source initiatives including OpenStack Sahara.

6. ConclusionIn this paper we have presented and defended the archi-

tectural choice of our unified cloud provisioning frameworkfor big-data and distributed computation in the cloud, calledDISCO. The design is generic and independent of anyspecific cloud platform, the integration with various cloudsis achieved through specific drivers leaving the generalmechanism unchanged. In the initial prototypical implemen-tation of the platform we have chosen Apache Hadoop,Spark and Storm open source solutions to support overOpenStack cloud, but this does not limit us from integratingand supporting other platforms in the future. One featureof our platform is the manifestation of all the key cloud

17Ambari: http://ambari.apache.org/18Apache Mahout: http://mahout.apache.org/19AMR product Details: http://aws.amazon.com/elasticmapreduce/details/

[accessed: 16.03.2015]

principles - self-service, on-demand, elastic, measured pay-as-you-go, etc. This is similar to commercial services such asAmazon EMR, but DISCO framework will be open-source,in principle, and will provide support for wide variety ofdistributed computing paradigms and not just map-reduce.

We realize that many scientific groups have their own in-house developed, supported data analysis solutions, henceour focus will also be on enabling such teams to bring intheir custom distributed computing application through ourDISCO framework onto popular IaaS clouds. In [32], authorshave proposed a cost-optimized configuration managementand deployment strategy of a data analytics workload in thecloud. Their work attempts to search the optimal configu-ration that satisfies the Service Level Objectives (SLOs) ofthe workload while minimizing the cost of deployment overa public cloud. This could form the basis for SLO and SLAaware deployment and configuration management in DISCO.

Acknowledgment

The work is inspired partly by the MCN orchestrationframework and is supported by the European CommunitySeventh Framework Programme (FP7/ 2001âAS2013) undergrant agreement no.318109.

References

[1] I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar,A. Gani, and S. U. Khan, “The rise of big data on cloudcomputing: Review and open research issues,” Information Systems,vol. 47, no. 0, pp. 98 – 115, 2015. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0306437914001288

[2] “Loon for All - Project Loon - google,” http://www.google.com/loon/,accessed: 2015-03-11.

[3] “internet.org by facebook,” http://internet.org/, accessed: 2015-03-11.[4] P. Bogdan and R. Marculescu, “Towards a science of cyber-physical

systems design,” in Proceedings of the 2011 IEEE/ACM SecondInternational Conference on Cyber-Physical Systems, ser. ICCPS ’11.Washington, DC, USA: IEEE Computer Society, 2011, pp. 99–108.[Online]. Available: http://dx.doi.org/10.1109/ICCPS.2011.14

[5] “Open Automotive Alliance,” http://www.openautoalliance.net/, ac-cessed: 2015-03-12.

[6] A. Mostosi, “The Big-Data Ecosystem Table,” http://bigdata.andreamostosi.name/, accessed: 2015-03-11.

[7] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing onlarge clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.[Online]. Available: http://doi.acm.org/10.1145/1327452.1327492

[8] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google filesystem,” in Proceedings of the Nineteenth ACM Symposiumon Operating Systems Principles, ser. SOSP ’03. New York,NY, USA: ACM, 2003, pp. 29–43. [Online]. Available: http://doi.acm.org/10.1145/945445.945450

[9] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributedstream computing platform,” in Data Mining Workshops (ICDMW),2010 IEEE International Conference on, Dec 2010, pp. 170–177.

[10] E. Chen, “Improving Twitter Search with Real-TimeHuman Computation),” http://blog.echen.me/2013/01/08/improving-twitter-search-with-real-time-human-computation/,accessed: 2015-03-19.

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 413

[11] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apachehadoop yarn: Yet another resource negotiator,” in Proceedings of the4th Annual Symposium on Cloud Computing, ser. SOCC ’13. NewYork, NY, USA: ACM, 2013, pp. 5:1–5:16. [Online]. Available:http://doi.acm.org/10.1145/2523616.2523633

[12] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed file system,” in Mass Storage Systems and Technologies(MSST), 2010 IEEE 26th Symposium on, May 2010, pp. 1–10.

[13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, andI. Stoica, “Spark: Cluster computing with working sets,” inProceedings of the 2Nd USENIX Conference on Hot Topicsin Cloud Computing, ser. HotCloud’10. Berkeley, CA, USA:USENIX Association, 2010, pp. 10–10. [Online]. Available:http://dl.acm.org/citation.cfm?id=1863103.1863113

[14] M. Franklin, “The berkeley data analytics stack: Present and future,”in Big Data, 2013 IEEE International Conference on, Oct 2013, pp.2–3.

[15] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Tachyon:Reliable, memory speed storage for cluster computing frameworks,”in Proceedings of the ACM Symposium on Cloud Computing, ser.SOCC ’14. New York, NY, USA: ACM, 2014, pp. 6:1–6:15.[Online]. Available: http://doi.acm.org/10.1145/2670979.2670985

[16] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,R. Katz, S. Shenker, and I. Stoica, “Mesos: A platform forfine-grained resource sharing in the data center,” in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation, ser. NSDI’11. Berkeley, CA, USA: USENIXAssociation, 2011, pp. 295–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=1972457.1972488

[17] Wikipedia, “Storm (event processor),” http://en.wikipedia.org/wiki/Storm_%28event_processor%29, accessed: 2015-03-18.

[18] Apache Storm Community, “Transactional Topologies,” https://storm.apache.org/documentation/Transactional-topologies.html, ac-cessed: 2015-03-19.

[19] “CYCLOPS - Rating, Charging, Billing solution for Cloud Providers,”http://icclab.github.io/cyclops/, accessed: 2015-03-11.

[20] P. Harsh, K. Benz, I. Trajkovska, A. Edmonds, P. M. Comi, andT. M. Bohnert, “A highly available generic billing architecturefor heterogenous mobile cloud services,” in Proceedings of the2014 International Conference on Grid & Cloud Computing

& Applications, ser. GCA ’14. CSREA, 2014, pp. 29–38. [Online]. Available: http://worldcomp-proceedings.com/proc/proc2014/gca.html

[21] S. Patanjali, B. Truninger, P. Harsh, and T. M. Bohnert, “CYCLOPS:a micro service based approach for dynamic rating, charging & billingfor cloud,” in The 13th International Conference on Telecommunica-tions (ConTEL 2015), Graz, Austria, July 2015.

[22] R. T. Fielding, “Architectural styles and the design of network-basedsoftware architectures,” Ph.D. dissertation, University of California,Irvine, 2000.

[23] W. Guo and Y. Wang, “An incident management model for saas ap-plication in the it organization,” in Research Challenges in ComputerScience, 2009. ICRCCS ’09. International Conference on, Dec 2009,pp. 137–140.

[24] J. Cusick and G. Ma, “Creating an itil inspired incident manage-ment approach: Roots, response, and results,” in Network Opera-tions and Management Symposium Workshops (NOMS Wksps), 2010IEEE/IFIP, April 2010, pp. 142–148.

[25] A. Keller, “Towards autonomic networking middleware,” May 2005.[Online]. Available: http://www.research.ibm.com/people/a/akeller/Data/ngnm2005_slides.pdf

[26] B. Callaghan, B. Pawlowski, and P. Staubach, “NFS Version 3Protocol Specification,” June 1995, RFC 1813. [Online]. Available:http://tools.ietf.org/html/rfc1813

[27] OpenStack Sahara Community, “Sahara - OpenStack,” https://wiki.openstack.org/wiki/Sahara, accessed: 2015-03-16.

[28] ——, “Sahara/SparkPlugin - OpenStack,” https://wiki.openstack.org/wiki/Sahara/SparkPlugin, accessed: 2015-03-16.

[29] Apache Pig! Community, “Welcome to Apache Pig!” http://pig.apache.org/, accessed: 2015-03-16.

[30] Apache Hive Community, “Apache Hive TM,” https://hive.apache.org/, accessed: 2015-03-16.

[31] Apache Ambari Community, “Ambari Design,” https://issues.apache.org/jira/secure/attachment/12559939/Ambari_Architecture.pdf,accessed: 2015-03-18.

[32] R. Mian, P. Martin, and J. L. Vazquez-Poletti, “Provisioningdata analytic workloads in a cloud,” Future Generation ComputerSystems, vol. 29, no. 6, pp. 1452 – 1458, 2013, including Specialsections: High Performance Computing in the Cloud & ResourceDiscovery Mechanisms for {P2P} Systems. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X12000209

414 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |