[ieee 2010 ieee international conference on cloud computing (cloud) - miami, fl, usa...

8
SciCumulus: A Lightweigh Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows Daniel de Oliveira, Eduardo Ogasawara COPPE/UFRJ Fernanda Baião NP2Tec/UNIRIO Marta Mattoso COPPE/UFRJ Abstract - Most of the large-scale scientific experiments modeled as scientific workflows produce a large amount of data and require workflow parallelism to reduce workflow execution time. Some of the existing Scientific Workflow Management Systems (SWfMS) explore parallelism techniques - such as parameter sweep and data fragmentation. In those systems, several computing resources are used to accomplish many computational tasks in homogeneous environments, such as multiprocessor machines or cluster systems. Cloud computing has become a popular high performance computing model in which (virtualized) resources are provided as services over the Web. Some scientists are starting to adopt the cloud model in scientific domains and are moving their scientific workflows (programs and data) from local environments to the cloud. Nevertheless, it is still difficult for the scientist to express a parallel computing paradigm for the workflow on the cloud. Capturing distributed provenance data at the cloud is also an issue. Existing approaches for executing scientific workflows using parallel processing are mainly focused on homogeneous environments whereas, in the cloud, the scientist has to manage new aspects such as initialization of virtualized instances, scheduling over different cloud environments, impact of data transferring and management of instance images. In this paper we propose SciCumulus, a cloud middleware that explores parameter sweep and data fragmentation parallelism in scientific workflow activities (with provenance support). It works between the SWfMS and the cloud. SciCumulus is designed considering cloud specificities. We have evaluated our approach by executing simulated experiments to analyze the overhead imposed by clouds on the workflow execution time. Keywords: Scientific Workflows, Middleware, Cloud Computing, Many Task Computing (MTC) I. INTRODUCTION Cloud Computing [1] has emerged as an alternative computing model where Web-based services allow for different kinds of users to obtain a large variety of resources, such as software and hardware. Cloud computing has demonstrated applicability to a wide-range of problems in several domains, including scientific ones. In fact, some scientists are starting to adopt this computing model and moving their experiments (programs and data) from local environments to the cloud [2-4]. An important advantage provided by clouds is that scientists are not required to assemble expensive computational infra-structure to execute their experiments or even configure many pieces of software. An average scientist is able to run experiments by allocating the necessary resources in a cloud [2]. In their experiments, scientists may use different programs to perform a specific activity of the experiment. Data produced by one activity normally needs to be passed as input to another activity, and shimming [5] activities may need to be performed during the execution. This chain of programs that composes a scientific experiment is usually modeled as a scientific workflow, which is an abstraction that allows the structured composition of programs as a flow of activities aiming at a desired result [6]. This data flow can be managed in an ad-hoc way, but it is more adequately handled by complex engines called Scientific Workflow Management Systems (SWfMS) [6], which offer a computational support for scientist to define, execute, and monitor scientific workflows either locally or in remote environments. Scientific experiments that are modeled as scientific workflows should follow a specific scientific method and are characterized by the composition (and execution) of several variations of workflows within the same experiment [7]. These variations include changing input data and parameters, for example when parameter sweep is required [8]. This situation occurs in data mining workflows, when the same workflow is exhaustively executed against different parameter configurations until the exploration finishes. In addition, in many important scenarios, workflow variations cannot be processed in a feasible time by a single computer or a small cluster. Due to this characteristic, scientists need some parallelism on the workflow to reduce total execution time. These experiments that require parameter sweep may involve the exploration of thousands of parameters and in many cases using repetition structures (loops) over dozens of complex workflow activities [8]. This type of computational model needs some level of parallelism to be accomplished in a reasonable time. Although this kind of model is a natural candidate for parallel processing, the total number of (independent or dependent) tasks and available processors (cores) makes this computational model complex to be managed in an ad-hoc manner. Recently the paradigm of Many Tasks Computing (MTC) was proposed [9] to solve the problem of executing multiple parallel tasks in multiple processors. This paradigm consists on several computing resources used over short periods of time to accomplish many computational tasks. Some existing SWfMS are already exploring workflow parallelism using MTC in homogeneous computing environments, such as computational clusters [10,11]. This type of environment relies on centralized control of resources, which eases parallelism and exploits high speed communication networks which brings high performance to a scientific experiment. Due to the move of scientific experiments to clouds and increasing demands of those experiments for parallelization, executing parallel scientific workflows in clouds is still a t 2010 IEEE 3rd International Conference on Cloud Computing 978-0-7695-4130-3/10 $26.00 © 2010 IEEE DOI 10.1109/CLOUD.2010.64 378

Upload: marta

Post on 26-Jan-2017

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

SciCumulus: A Lightweigh Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows

Daniel de Oliveira, Eduardo Ogasawara

COPPE/UFRJ Fernanda Baião

NP2Tec/UNIRIO Marta Mattoso COPPE/UFRJ

Abstract - Most of the large-scale scientific experiments modeled as scientific workflows produce a large amount of data and require workflow parallelism to reduce workflow execution time. Some of the existing Scientific Workflow Management Systems (SWfMS) explore parallelism techniques - such as parameter sweep and data fragmentation. In those systems, several computing resources are used to accomplish many computational tasks in homogeneous environments, such as multiprocessor machines or cluster systems. Cloud computing has become a popular high performance computing model in which (virtualized) resources are provided as services over the Web. Some scientists are starting to adopt the cloud model in scientific domains and are moving their scientific workflows (programs and data) from local environments to the cloud. Nevertheless, it is still difficult for the scientist to express a parallel computing paradigm for the workflow on the cloud. Capturing distributed provenance data at the cloud is also an issue. Existing approaches for executing scientific workflows using parallel processing are mainly focused on homogeneous environments whereas, in the cloud, the scientist has to manage new aspects such as initialization of virtualized instances, scheduling over different cloud environments, impact of data transferring and management of instance images. In this paper we propose SciCumulus, a cloud middleware that explores parameter sweep and data fragmentation parallelism in scientific workflow activities (with provenance support). It works between the SWfMS and the cloud. SciCumulus is designed considering cloud specificities. We have evaluated our approach by executing simulated experiments to analyze the overhead imposed by clouds on the workflow execution time.

Keywords: Scientific Workflows, Middleware, Cloud Computing, Many Task Computing (MTC)

I. INTRODUCTION Cloud Computing [1] has emerged as an alternative

computing model where Web-based services allow for different kinds of users to obtain a large variety of resources, such as software and hardware. Cloud computing has demonstrated applicability to a wide-range of problems in several domains, including scientific ones. In fact, some scientists are starting to adopt this computing model and moving their experiments (programs and data) from local environments to the cloud [2-4]. An important advantage provided by clouds is that scientists are not required to assemble expensive computational infra-structure to execute their experiments or even configure many pieces of software. An average scientist is able to run experiments by allocating the necessary resources in a cloud [2].

In their experiments, scientists may use different programs to perform a specific activity of the experiment.

Data produced by one activity normally needs to be passed as input to another activity, and shimming [5] activities may need to be performed during the execution. This chain of programs that composes a scientific experiment is usually modeled as a scientific workflow, which is an abstraction that allows the structured composition of programs as a flow of activities aiming at a desired result [6]. This data flow can be managed in an ad-hoc way, but it is more adequately handled by complex engines called Scientific Workflow Management Systems (SWfMS) [6], which offer a computational support for scientist to define, execute, and monitor scientific workflows either locally or in remote environments. Scientific experiments that are modeled as scientific workflows should follow a specific scientific method and are characterized by the composition (and execution) of several variations of workflows within the same experiment [7]. These variations include changing input data and parameters, for example when parameter sweep is required [8]. This situation occurs in data mining workflows, when the same workflow is exhaustively executed against different parameter configurations until the exploration finishes. In addition, in many important scenarios, workflow variations cannot be processed in a feasible time by a single computer or a small cluster. Due to this characteristic, scientists need some parallelism on the workflow to reduce total execution time.

These experiments that require parameter sweep may involve the exploration of thousands of parameters and in many cases using repetition structures (loops) over dozens of complex workflow activities [8]. This type of computational model needs some level of parallelism to be accomplished in a reasonable time. Although this kind of model is a natural candidate for parallel processing, the total number of (independent or dependent) tasks and available processors (cores) makes this computational model complex to be managed in an ad-hoc manner. Recently the paradigm of Many Tasks Computing (MTC) was proposed [9] to solve the problem of executing multiple parallel tasks in multiple processors. This paradigm consists on several computing resources used over short periods of time to accomplish many computational tasks. Some existing SWfMS are already exploring workflow parallelism using MTC in homogeneous computing environments, such as computational clusters [10,11]. This type of environment relies on centralized control of resources, which eases parallelism and exploits high speed communication networks which brings high performance to a scientific experiment.

Due to the move of scientific experiments to clouds and increasing demands of those experiments for parallelization, executing parallel scientific workflows in clouds is still a

t

2010 IEEE 3rd International Conference on Cloud Computing

978-0-7695-4130-3/10 $26.00 © 2010 IEEE

DOI 10.1109/CLOUD.2010.64

378

Page 2: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

challenge. Although clouds provide elastic resources that can be used to execute parallel instances of a specific scientific workflow, it is still difficult for scientists to express a parallel computing paradigm for the workflow on the cloud. It is possibly feasible for scientists to do this task in an ad-hoc way. However, ad-hoc approaches are not scalable; they may be exhaustive for scientists and error-prone and may become a barrier to execute the entire experiment. In this case, a specialized computational infrastructure is needed to simplify this execution, thus structuring the parallel execution and isolating scientists from its complexity. Existing approaches for executing scientific workflows using parallel processing are mainly focused on homogeneous environments [8,12] whereas on the cloud, the scientist has to manage some new important aspects such as initialization of virtualized instances, scheduling workflow activities over different cloud environments, impact of data transferring and management of instance images.

Another important aspect to be considered is associated to the data related to scientific workflows. The information about the experiment definition and the results produced by executions are fundamental to scientific experimentation process. A scientific experiment is considered “scientific” only if it can be reproducible. To reproduce a scientific experiment, scientists have to analyze data from previous executions. This data is called Provenance [13]. When scientists execute workflows in distributed environments it may become a complex task to capture and analyze provenance data. In addition, to capture distributed provenance data from the cloud is also an open, yet important, issue that should be considered.

Additionally, some existing environments present some architectural limitations that can also represent a barrier for the conduction of the scientific experiment. For example, some cloud environments limit the maximum number of virtualized instances to a (small) fixed number, such as Amazon EC2 [14]*. Scientists are only able to parallelize scientific workflows using those limited amount of instances, and in many cases they need much more than this to achieve results in a reasonable time. In many large scale experiments, high performance requirements lead scientists to run experiments in more than one cloud environment or using different accounts in the same cloud provider (thus using more virtualized instances) and they need a computational infra-structure to provide support for scientists.

This computational infra-structure should be able to provide ways to isolate scientists from the complexity of managing parallel workflow executions in clouds. Therefore, the infra-structure itself should be in charge of starting, monitoring and transparently controlling the distributed execution. This way, scientists do not have to manage infra-structure components or to worry about configurations, allowing them to focus only on the experimentation process itself. There are solutions like Map-Reduce [15] that aim to facilitate the distribution and control of parallel activities in computing nodes. Although it facilitates distribution, the

* Amazon EC2 constantly updates their services. The features discussed in the paper are based on a snapshot from March 2010.

concept of Map-Reduce is disconnected from the paradigm of scientific workflow and scientific experiments and imposes the need of programming to scientists.

It is preferable that scientists have a transparently form to control and monitor distributed executions of activities within a workflow. One possible solution is to use a middleware to promote the transparent execution of scientific workflows in clouds using MTC.

In this paper, we introduce SciCumulus, a cloud middleware to support MTC paradigm in scientific workflows with provenance support. SciCumulus promotes workflow parallelism in clouds following the MTC paradigm, isolating the scientists from the complexity of the cloud environments and collecting distributed provenance data. SciCumulus works as a complement to current SWfMS. It was designed considering cloud architectural characteristics and specificities. We evaluated SciCumulus by executing simulated experiments in a simulation environment developed on top of CloudSim [16]. The goal is to analyze the overhead imposed by clouds.

This paper is organized as follows. Section 2 describes the conceptual architecture and the types of parallelism that are supported by SciCumulus. Section 3 briefly describes the simulator, an extension of CloudSim to support workflow simulation on clouds with parallel techniques. Section 4 shows experimental results. Section 5 presents related work, and we conclude in Section 6.

II. SCICUMULUS SciCumulus is a lightweight middleware designed to

distribute, control and monitor parallel execution of scientific workflow activities (or even entire scientific workflows) from a SWfMS into a cloud environment, such as Amazon EC2. SciCumulus orchestrates the execution of workflow activities on a distributed set of virtualized machines. In addition, SciCumulus is itself distributed. It is composed by three-layer architecture: (i) Desktop - dispatches workflow activities to be executed in the cloud environment using a local SWfMS such as VisTrails [17], (ii) Distribution - manages the execution of activities in cloud environments, and (iii) Execution - executes programs from workflows.

To isolate scientists from the complexity of distributing workflow activities (or entire workflows) using MTC paradigm in cloud environments, SciCumulus promotes the use of control workflow components. SciCumulus provides two different types of parallelism: parameter sweep [8] and data parallelism [18]. Controlling those types of parallelism is difficult when done in an ad-hoc way in cloud environments due to many tasks to be managed.

This way, SciCumulus provides a computational infra-structure to support workflow parallelism with provenance gathering of the cloud environment. SciCumulus architecture is simple and may be deployed and linked to any existing SWfMS, diminishing effort from scientists. The following sub-sections detail the concept of a cloud activity, the types of parallelism supported by SciCumulus and its components.

A. Cloud Activity One of the most important definitions in SciCumulus

379

Page 3: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

architecture is the notion of a Cloud Activity, which is not a consensus considering all possible parallel execution strategies. A cloud activity that is executed in SciCumulus contains the program to be executed, its parallelism strategy, parameters values and data to be consumed. Cloud activities are quite different from workflow activities, since workflow activities are not concerned about parallelism strategies chosen by scientists. As a parallelism strategy, for example, we consider the roles to fragment and aggregate data in complex workflows (such as map and reduce functions in Map-Reduce strategy [19]).

Cloud activities require that scientists specify parallelism metadata such as type of parallelism, number of fragments to be produced, parameters to be explored, and interval of parameter values to be explored, performance requirements, and time restrictions. One example of cloud activity may include a bioinformatics program BLAST [3], its parameters and data sets which are encapsulated in one unit of work to be run in a cloud environment.

B. Cloud Activity Finite State Machine SciCumulus cloud activities are ruled by a finite state

machine. The Cloud activity finite state machine has four states: initialized, activated, queued, and finished.

When SciCumulus starts a cloud activity, it receives an Initialized state. If, for some reason, a cloud activity cannot immediately start executing, it receives a queued state. The cloud activity may proceed to the activated state if it can be executed normally. Once the activity execution finishes, it receives the finished state. When the cloud activity has a problem like a corrupted program or corrupted data, and it cannot be executed, then it goes from the initialized (or queued) state to finished.

C. Types of Paralelism in SciCumulus SciCumulus aims at providing a middleware that allows

scientists to implement two types of parallel functionalities. These two types of parallelism are very common in scientific experimentation and we have chosen to initially support them on SciCumulus. This sub-section briefly describes each one of them: parameter sweep [8] and data parallelism [18]. Our parallelism formalism is adapted from [18].

A given scientific workflow W is composed by a set of activities chained in a coherent flow to represent one scientific model. A workflow W is represented by a quadruple (A,Pt,I,O) where: A is a set {a1, a2, …, an} of activities that composes the workflow W; I is a set {i1, i2, …, im} of data to be consumed; Pt is a set {pt1, pt2, …, ptr} of parameters of W and O is a set of output data {o1, o2, …, on}.

Data parallelism can be characterized by the simultaneous execution of one cloud activity ai, where each execution of ai consumes a specific subset of the input data. It can be achieved by generating many instances of ai in the cloud environment and each instance stores a specific a subset (i1, i2, …, im) of the entire input data I.

Parameter sweep parallelism may be defined as simultaneous executions of one cloud activity ai, where each execution of ai consumes a specific subset of the possible values for parameters Pt. It can be achieved by replicating

cloud activity ai in the many instances of the cloud environment and each instance stores a specific a subset of the possible parameters values.

D. Architecture Figure 1 presents SciCumulus conceptual architecture in

three main layers. The first layer comprises desktop components that are placed at the scientist machine. The second layer is composed by components that handle distributed activities over the cloud and the third layer is composed by the execution instances where cloud activities are actually executed. The numbers alongside the arrows represent the execution order of the components.

1) SciCumulus Desktop Layer The Desktop Layer aims to provide components that

allow for scientists to compose and execute their workflows in a parallel and transparent way, diminishing the complexity of the modeling phase. The idea is to replace components in the local SWfMS by SciCumulus components, which communicates with the distribution components to promote data distribution and parallelization. This layer has five main components: Desktop Setup, Uploader, Dispatcher, Downloader and Provenance Capturer.

These last four components have to be inserted by the scientist into the original sequential scientific workflow (also modeled by the scientist), replacing each activity that can be parallelized. If the scientist plans to parallelize activity ai, it should be replaced by these four components to promote workflow parallelization.

The Desktop Setup configures the parallel execution of the workflow. This component offers a setup mode for scientists to define the parallel strategy (if it is data parallelism or parameter sweep), input data that needs to be transferred to the cloud environment, programs that have to be distributed over the cloud and additional minor information. The setup is then sent to the distribution layer to configure the whole cloud environment.

The Uploader moves data (files, parameters, and so on) to the cloud while Downloader gathers the final results from cloud instances. This component verifies in the desktop setup in which cloud instances the results are stored and to which cloud instances they are being transferred to. The uploader and downloader components are preferably invoked when the scientific experiment uses small size input data and produces a small size output. This is due to overhead of transferring data over the internet to the cloud environment. It may be unfeasible to transfer huge amounts of data. For example, if scientists intend to run parameter sweep using a 256 Kb input file, it is feasible to transfer this amount of data which cannot occur when scientists deal with 1 GB input data.

In cases of large amount of data it should be already in the cloud environment to avoid data transfer. These transfers from and to the cloud environment are initially designed to use secure connections (SSH). However, it is expected to improve security issues using mechanisms such as the ones presented by Hasan et al. [20].

The Dispatcher launches the execution process of cloud activities on the cloud and communicates directly to the

380

Page 4: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

execution broker in the distribution layer. It has also monitoring mechanisms to provide ways for scientists to follow the status of the running cloud activities. Once the distributed execution is finished, the Dispatcher returns control to the local SWfMS.

The Provenance Capturer collects provenance data from the cloud. It is enacted as the distributed parallel execution finishes. It collects distributed provenance data, stored in the cloud instances and transfers to a local repository. This way, scientists can write provenance queries such as “In which virtualized instances is stored the result of the workflow W1 that was executed on 03/03/2010 at 3:03pm?” By analyzing provenance data, computer engineers may evaluate performance to improve some components. The repository

aims to be compliant to the Open Provenance Model (OPM) [21]. Nevertheless, since OPM still does not represent distributed execution provenance data, we are still working in associating our provenance schema with the OPM model. The provenance repository was modeled to link prospective provenance and retrospective distributed provenance [13] collected during the scientific experiment life cycle [7].

2) SciCumulus Distribution Layer This layer manages the parallel execution of cloud

activities in the distributed environment. All components of this layer are hosted in the cloud. The layer has eight components: Execution Broker, Data Fragmenter, Parameter

Sweeper, Encapsulator, Scheduler, Distribution Controller and Data Summarizer.

The Execution Broker interfaces between the desktop layer and the distribution layer. It starts the parallel execution flow by querying the configuration repository. Depending on the parallelism strategy chosen by scientists, this component has a different behavior. It can either invoke parameter sweeper or data fragmenter components. The execution broker is also responsible for sending synchronization messages to the client dispatcher.

The Parameter Sweeper handles the combinations of parameters received by the client workflow for a specific cloud activity that is being parallelized. Workflow parameters may be directly or indirectly dependent on input

datasets; however, possible restrictions derived from this dependency are not currently considered in our proposal. The Data Fragmenter fragments the original data (stored in the Data Repository), and generates sub-sets of it to be distributed over the execution virtualized instances. It is important to highlight that the Data Fragmenter component is problem-dependent. In other words, it has to change its behavior depending on the type/format of the input data to be fragmented. In the case of FASTA files [3], there is a specific method to fragment it based on the distribution of DNA sequences within the file. Thus, the data fragmenter needs to be adapted to fragment a specific type of input data. However, it is not simple to deploy problem-independent

Execution Layer

Distribution Layer

Desktop Layer

Workflow

Configurator

Execution Broker

Downloader

Uploader

Dispatcher

Capturer

DesktopSetup

ParameterSweeper

Data Fragmenter

Scheduler

Encapsulator

DistributionController

Data Summarizer

Executor

Instance Controller

#1

Configuration Repository

Data Repository

GlobalSchema

Provenance

InstanceProvenance

Data Repository

1

2

3.1 3.2

4

5

6

7

8.1

10

11

1213

15

15

ParameterSweeper

Data Fragmenter

8.2 9

Figure 1 SciCumulus Conceptual Architecture

381

Page 5: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

components. We applied software reuse techniques [22], by designing extensible components through cartridges. A cartridge is a component unit that can be dynamically changed. Our cartridge is changed depending on the input data. This way, the fragmenter invokes a program that encapsulates fragmentation rules. This fragmentation rules change from one problem to another. The idea of the fragmenter is similar to the idea of a map function in a Map-Reduce strategy [19]. The data fragmenter is only invoked if the scientist sends the input data to be fragmented by the distribution layer. In some cases, the entire data (or even the fragments) is already placed on the execution layer and the distribution layer is not responsible for fragmenting it. In this case, this component only creates a fragmentation plan (that describes which part of the data set is related to each fragment to be consumed) and sends it to the encapsulator.

Both data fragmenter and parameter sweep components are deployed in each cloud instance. This was an important design decision since, when data to be consumed and parameter files are already in the cloud, it is necessary to execute those activities in each instance to avoid unnecessary data transfer.

After configuring the parallel strategy, SciCumulus starts to pack programs, (fragmented) data and (subsets of) parameters into one unit of work (cloud activity). The Encapsulator generates all cloud activities to be deployed to the virtualized instances.

Once the entire environment is set, the Scheduler starts working, defining which virtualized instances receive cloud activities to execute. Note that the virtualized instances may be provided by any cloud provider. The scheduler takes into account the available instances, the permissions to access instances and the computational power of each one of them. This information may be queried by scientists. After generating a scheduling plan, this component sends it to the distribution controller.

The Distribution Controller transfers cloud activities to the available instances and monitors its execution. If the number of cloud activities is greater than the number of available instances, the distribution controller puts some cloud activities in the queued status until there is an available instance to deploy the cloud activity. Once all cloud activities are executed, the distribution controller collects the provenance stored in the instances and the results to invoke the Data Summarizer. Existing cloud environments usually limit the maximum number of virtualized instances to a (small) fixed number. However, using SciCumulus, we may acquire more instances since the distribution controller can send cloud activities to more than one cloud provider or for different accounts of the same provider, allowing scientists to acquire a more significant amount of instances. The only requirement is that all components of the SciCumulus execution layer are deployed on the virtualized instances, independent of the used cloud provider. The distribution controller is also responsible for an important task: the initialization and management of instance images. Before sending the cloud activity for execution, the distribution controller queries the global schema to discover the specific instance image to deploy to finally initialize it.

The Data Summarizer aggregates partial results, generating a final result of the distributed parallel execution. Similar to the data fragmenter, the data summarizer is also problem-dependent and cartridges must be implemented depending on the problem. These cartridges have some basic pre-defined routines. Particularly in the case of data fragmentation parallelism, the cartridge encompasses an equivalent reduce function, as happens in Map-Reduce. After finishing its execution, it returns control to the desktop layer to transfer results and provenance to the scientists’ desktop.

3) SciCumulus Execution Layer The execution layer manages the execution of cloud

activities in virtualized instances. It has five main components: instance controller, data fragmenter, parameter sweeper, configurator and executor.

The Instance Controller interfaces between the distribution layer and the execution layer. It is also responsible for controlling the execution flow in the instance. The data fragmenter and the parameter sweeper components have the same behavior as when placed on the distribution layer. The main difference is that they are being executed in the instances to avoid data transfer. When data fragmentation or parameter sweep components are already executing in the distribution layer, they are not invoked by the instance controller.

The Configurator sets up the entire environment. The configurator unpacks the cloud activity, creates replicas of the program to a specific location; create directories to store parameter files, if needed, and stores data to be consumed. It creates different workspaces for different executions. Once the whole environment is set, the executor is invoked.

The Executor invokes the specific program, passing parameters and generating command lines to execute. It also stores provenance data in a local repository. If an error occurs, this component is responsible for sending a message to the instance controller that sends a message to the distribution controller. The following section explains how SciCumulus architecture was modeled and implemented in a simulation environment.

III. SCICUMULUS SIMULATION ENVIRONMENT The simulation environment aims to analyze how the

proposed architecture handles scientific workflow parallelism and distribution over cloud environments. We evaluate SciCumulus performance with the growth of virtualized instances involved in the workflow execution.

The simulation environment was carefully designed and implemented so as to reflect important characteristics of the real cloud environment, as well as to enable the evaluation of important issues embedded in SciCumulus proposal. The simulator was developed on top of CloudSim [16]. Some components were extended and many others were implemented from scratch. The simulator is focused on the distribution layer. Originally, CloudSim provides a generic broker modeled as a class named DataCenterBroker in the CloudSim conceptual model. We changed this class into our ExecutionBroker to model the behavior of this component and its particular placement policies.

382

Page 6: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

Part of the work also focused on the class Cloudlet. This class comprises programs to be executed in cloud instances. We extended this class to comprise the concept of cloud activities. Cloudlet was modified to encapsulate not only the program to be executed, but the data and parameters as well.

The simulator has three independent variables [23]: (i) the number of cloud activities of simulation, (ii) cloud activity data size (in megabytes) to be transferred to and from the cloud environment, and (iii) the size of the image to be initialized. We have also modeled errors in execution following a Poisson distribution [23]. This means that a cloud activity may not be executed sometimes. On the occurrence of an error, the distribution controller has to send the cloud activity for re-execution

All combinations of independent variables generate 4 configurations of simulation to be evaluated twice (for data fragmentation and parameter sweep). The dependent variables [23], i.e. the values we assess, are the workflow wall time, the time data transfer time of each activity executed on the cloud environment and the initialization time of each instance. The simulation environment execution time, data size and transfer time are based on a real scientific experiment of computational fluid dynamics [12].

IV. EXPERIMENTAL RESULTS To evaluate and analyze the overhead imposed by clouds

in a simulated environment based on a real scientific workflow execution, we conducted a series of experiments using SciCumulus architecture. Our experimental environment is based on a cloud pool consisting of 8 hosts (distributed in 4 simulated datacenters) with processor of 3.0 GHz, 16 GB of RAM and gigabit Ethernet connection. We used a Pentium dual Core PC with 4 GB of RAM and 500 GB of hard disk to run the simulation.

We are specifically interested in determining system performance for different types of parallelism (data fragmentation and parameter sweep) associated with the cloud activities. It is expected that these simulations indicate the performance gain in the real development. All virtualized machines associated with the hosts have 4 dual core CPU, 2.4GHz clock and 1GB of RAM.

The first experiment executed was a data fragmentation one. In this experiment we have partitioned a 10 MB data input file which has been fragmented in 100 (Figure 2) and 128 (Figure 3) fragments. Each fragment is associated with a cloud activity that produces 100 KB of data as output. In Figure 2 and Figure 3 we present the execution time of the cloud activities with and without considering data transfer.

The execution time of a workflow using SciCumulus and the whole process of distributing the workflow execution is presented in the “Execution + Data Transfer” line, in which we can observe the overhead of data transfer. The red line represents the total execution time of the distributed workflow, thus considering the effective execution of the distributed activities and the upload and download time to and from the cloud. The “Execution” line represents only the execution time of the distributed activities without taking into account the transfer times (upload and download of input and output files). As mentioned before, the upload and

download have a fixed time, since the input (10 MB) and output (100 Kb) data size are fixed.

We can observe in Figure 2 and Figure 3 that as the number of virtualized instances increases the total execution time of distributed activities decreases, as expected. However, the transfer time remains the same, thus increasing the impact of transfer time on total execution time, as the number of instances increases. We can also observe that the performance benefits of parallelizing are clear, but additional studies must be executed to verify the maximum size of input where transfers are an advantage compared only to the execution time. Acquiring more virtualized instances for execution may not bring much benefit, particularly if the number of cloud activities becomes close to the number of virtualized instances. In some cases, the input data is already placed be on the cloud environment, thus this transfer time may be avoided. Note that the performance decrease in Figure 2 when we use 128 virtualized instances, which does not happen in the scenario presented in Figure 3. This is due to the fact that in Figure 2 we have 100 cloud activities and 128 instances, thus 28 instances remain idle. Nevertheless, these instances have to be initialized even when they are not used. This initialization has a cost that, in the case of Figure 2 presents a negative impact on the overall performance.

Figure 2 Data Fragmentation -100 fragments

Figure 3 Data Fragmentation - 128 fragments

The second experiment executed was a parameter sweep. We created 100 (Figure 4) and 128 (Figure 5) parameter combinations to be evaluated. Each combination is associated with a cloud activity that produces 100k of data and each consumes a 10Mb input file. In Figure 4 and Figure

100

1000

10000

1 2 4 8 16 32 64 128

Execution Time (sec)

Number of virtualized  instances

Workflow Execution Time (100 Cloud Activity)

Execution

Execution + Data Transfer

Linear

100

1000

10000

1 2 4 8 16 32 64 128

Execution Time (sec)

Number of virtualized  instances

Workflow Execution Time (128 Cloud Activities)

Execution 

Execution + Data Transfer

Linear

383

Page 7: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

5 we present both the execution time of the cloud activities and the execution time considering data transfer.

Similarly to Figure 2 and Figure 3, the “Execution + Data Transfer” line represents the total execution time of the distributed workflow while the “Execution” line represents only the execution time of the distributed activities without taking into account the transfer times.

Figure 4 Parameter Sweep for 100 Combinations

Figure 5 shows that the distributed activity processing scales well on the cloud environment; however the impact of data transfer in the total elapsed time is higher than in the data fragmentation experiment. This problem occurs due to the fact that we have to transfer 10 MB of input data to each one of instances instead of transferring the fragment (100 Kb) as happened in the first experiment. We may intuitively infer that in cases of parameter sweep with many instances involved and one or more input files with more than some kilobytes, the input data should be already placed be on the cloud environment.

In Figure 4 there is a loss with 128 virtualized instances which does not happen in the scenario presented in Figure 5, which is a similar scenario compared to Figure 2 and Figure 3 in which the initialization process of idle instances produces a negative impact on the overall performance.

V. RELATED WORK There are some approaches in the literature that propose

mechanisms to provide MTC capabilities to scientific applications. Nimrod/K [24] proposes a set of components and a run time machine for SWfMS Kepler. Nimrod/K is based on an architecture that uses tagged dataflow concepts for highly parallel machines. These concepts are implemented as a Kepler director that orchestrates the execution on clusters, Grids and Clouds using MTC. Although Nimrod/K is a step forward, it is only focused on parameter exploration without concerning about data fragmentation and provenance capture.

Hydra [12] is a set of components to be included on the workflow specification of any SWfMS to control parallelism of activities as MTC. Using Hydra, the MTC parallelism strategy can be registered, reused, and provenance may be gathered. Although Hydra provides data fragmentation and parameter sweep mechanisms, it is based on a homogeneous cluster environment and relies on a centralized scheduler, which is not the reality in clouds.

Hoffa et al. [4] propose the use of virtualized clusters with SWfMS Pegasus [10] to evaluate the tradeoffs between running scientific workflows in a local environment and running in a virtualized environment. Their perspective was not to analyze detailed performance metrics, but show that domain scientists can use different environments to perform their experiments. Although they coupled a virtualized environment to a SWfMS, this approach is specific for Pegasus and it appears only to work with virtualized clusters in Nimbus science cloud [2].

Figure 5 Parameter Sweep for 128 Combinations

The approach proposed by [19] introduced a Map-Reduce based scientific workflow composition framework that is composed by a dataflow-based scientific workflow model and a set of dataflow constructs, including Map, Reduce, Loop, and Conditional constructs to enable a Map-Reduce approach to scientific workflows. The framework was implemented and evaluated. However differently from SciCumulus, this proposed approach does not capture provenance or provides a parameter sweep capability.

VI. CONCLUSIONS AND FINAL REMARKS Large scientific experiments have a long duration where

many executions using different parameters and input data are necessary to reach a final conclusion and confirm or refute a hypothesis [23]. These executions take a long time and scientists have to use parallel techniques to improve performance. However, it is far from trivial to manage parallel executions of large scientific workflows [7], particularly in cloud environments.

Our previous work [12] addressed workflow execution following the MTC paradigm in homogeneous cluster environments. This solution cannot be applied to heterogeneous environments such as clouds, due to architectural differences between those environments, such as managing image instances, instance initialization, etc.

To address parallel execution of scientific workflows in clouds, we proposed SciCumulus, a lightweight middleware to support workflow execution in clouds following the MTC paradigm. The main goal of SciCumulus is to use a SWfMS to provide transparent parallelism. To achieve high performance while keeping track of the workflow provenance, SciCumulus has specific components allowing execution on the high performance environment capturing

100

1000

10000

1 2 4 8 16 32 64 128

Execution Time (sec)

Number of virtualized  instances

Workflow Execution Time (100 Cloud Activities) 

Execution 

Execution + Data Transfer

Linear

100

1000

10000

1 2 4 8 16 32 64 128

Execution Time (sec)

Number of virtualized  instances

Workflow Execution Time (128 Cloud Activities) 

Execution 

Execution + Data Transfer 

Linear

384

Page 8: [IEEE 2010 IEEE International Conference on Cloud Computing (CLOUD) - Miami, FL, USA (2010.07.5-2010.07.10)] 2010 IEEE 3rd International Conference on Cloud Computing - SciCumulus:

execution information to be analyzed. SciCumulus provides two types of parallelism that can benefit from MTC: (i) parameter and (ii) data parallelism.

Our simulated experimental results showed performance gains in both types of parallelism. To evaluate SciCumulus overhead using different number of virtualized instances, we run the simulated experiment using 100 and 128 fragments and 100 and 128 parameters values. This leads to 100 and 128 small cloud activities in each experiment. The execution time of the workflow was significantly reduced as the number of virtualized instances involved for the distributed cloud activity increased. The overhead of transferring data from the local SWfMS to the cloud environment and vice-versa and the overhead of initializing instances were acceptable when compared to the degree of parallelism obtained with parameter sweeper and data fragmentation.

Although these are simulated results, we believe that first results showed the viability of SciCumulus architecture and real experiments need to be performed to fully evaluate the approach and the two types of parallelism presented in this paper. The implementation of SciCumulus components is an ongoing work and many components are already implemented on top of VisTrails SWfMS [17]. Future work includes the study of the monetary impact in workflows executions on commercial cloud environments, and dealing with restrictions imposed by the combinations of workflow parameters and their dependency on input datasets. Acknowledgements. The authors thank CNPq and CAPES for partially sponsoring our research and Diego Pereira for the implementation of the simulation environment.

REFERENCES [1] D. Oliveira, F. Baião, and M. Mattoso, 2010, "Towards a

Taxonomy for Cloud Computing from an e-Science Perspective", Cloud Computing: Principles, Systems and Applications (to be published), Heidelberg: Springer-Verlag

[2] L. Wang, J. Tao, M. Kunze, A.C. Castellanos, D. Kramer, and W. Karl, 2008, Scientific Cloud Computing: Early Definition and Experience, In: 10th IEEE HPCC, p. 825-830, Los Alamitos, CA, USA.

[3] A. Matsunaga, M. Tsugawa, and J. Fortes, 2008, CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications, IEEE eScience 2008, v. 0, p. 229, 222.

[4] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, and J. Good, 2008, On the use of cloud computing for scientific workflows, In: IEEE Fourth International Conference on eScience (eScience 2008), Indianapolis, USA, p. 7–12

[5] C. Lin, S. Lu, X. Fei, D. Pai, and J. Hua, 2009, A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows, In: Proc. Services 2009, p. 284-291

[6] I.J. Taylor, E. Deelman, D.B. Gannon, M. Shields, and (Eds.), 2007, Workflows for e-Science: Scientific Workflows for Grids. 1 ed. Springer.

[7] M. Mattoso, C. Werner, G.H. Travassos, V. Braganholo, L. Murta, E. Ogasawara, D. Oliveira, S.M.S.D. Cruz, and W. Martinho, 2010, Towards Supporting the Life Cycle of Large Scale Scientific Experiments, To be published in Int. J.

Business Process Integration and Management, n. Special Issue on Scientific Workflows

[8] E. Walker and C. Guiang, 2007, Challenges in executing large parameter sweep studies across widely distributed computing environments, In: Workshop on Challenges of large applications in distributed environments, p. 11-18, Monterey, California, USA.

[9] I. Raicu, I. Foster, and Yong Zhao, 2008, Many-task computing for grids and supercomputers, In: Workshop on Many-Task Computing on Grids and Supercomputers, p. 1-11

[10] E. Deelman, G. Mehta, G. Singh, M. Su, and K. Vahi, 2007, "Pegasus: Mapping Large-Scale Workflows to Distributed Resources", Workflows for e-Science, Springer, p. 376-394.

[11] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, 2007, Swift: Fast, Reliable, Loosely Coupled Parallel Computation, In: Services 2007, p. 206, 199

[12] E. Ogasawara, D. Oliveira, F. Chirigati, C.E. Barbosa, R. Elias, V. Braganholo, A. Coutinho, and M. Mattoso, 2009, Exploring many task computing in scientific workflows, In: MTAGS 09, p. 1-10, Portland, Oregon.

[13] J. Freire, D. Koop, E. Santos, and C.T. Silva, 2008, Provenance for Computational Tasks: A Survey, Computing in Science and Engineering, v.10, n. 3, p. 11-21.

[14] Amazon EC2, 2010. Amazon Elastic Compute Cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2). Dispon?vel em: http://aws.amazon.com/ec2/. Acesso em: 5 Mar 2010.

[15] J. Dean and S. Ghemawat, 2008, MapReduce: simplified data processing on large clusters, Commun. ACM, v. 51, n. 1, p. 107-113.

[16] R. Buyya, R. Ranjan, and R. Calheiros, 2009, Modeling and Simulation of Scalable Cloud Computing Environments and the CloudSim Toolkit: Challenges, In: Proc. of HPCS 2009HPCS 2009, Leipzig, Germany.

[17] S.P. Callahan, J. Freire, E. Santos, C.E. Scheidegger, C.T. Silva, and H.T. Vo, 2006, VisTrails: visualization meets data management, In: Proceedings of the 2006 ACM SIGMOD, p. 745-747, Chicago, IL, USA.

[18] L. Meyer, D. Scheftner, J. Vöckler, M. Mattoso, M. Wilde, and I. Foster, 2007, "An Opportunistic Algorithm for Scheduling Workflows on Grids", High Performance Computing for Computational Science - VECPAR 2006, , p. 1-12.

[19] X. Fei, S. Lu, and C. Lin, 2009, A MapReduce-Enabled Scientific Workflow Composition Framework, In: ICWS, p. 663-670, Los Alamitos, CA, USA.

[20] R. Hasan, R. Sion, and M. Winslett, 2007, Introducing secure provenance: problems and challenges, In: Proceedings of the 2007 ACM workshop on Storage security and survivability, p. 13-18, Alexandria, Virginia, USA.

[21] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson, 2008, "The Open Provenance Model: An Overview", Provenance and Annotation of Data and Processes, , p. 323-326.

[22] C. Szyperski, 1997, Component Software: Beyond Object-Oriented Programming. Addison-Wesley Professional.

[23] N. Juristo and A.M. Moreno, 2001, Basics of Software Engineering Experimentation. 1 ed. Springer.

[24] D. Abramson, C. Enticott, and I. Altinas, 2008, Nimrod/K: towards massively parallel dynamic grid workflows, In: SC 08, p. 1-11, Austin, Texas.

385