performance-oriented deployment of streaming applications ... · streaming applications to run on...

Performance-Oriented Deploymentof Streaming Applications on Cloud

Xunyun Liu and Rajkumar Buyya , Fellow, IEEE

Abstract—Performance of streaming applications are significantly impacted by the deployment decisions made at infrastructure level,

i.e., number and configuration of resources allocated for each functional unit of the application. The current deployment practices are

mostly platform-oriented, meaning that the deployment configuration is tuned to a static resource-set environment and thus is inflexible

to use in cloud with an on-demand resource pool. In this paper, we propose P-Deployer, a deployment framework that enables

streaming applications to run on IaaS clouds with satisfactory performance and minimal resource consumption. It achieves

performance-oriented, cost-efficient and automated deployment by holistically optimizing the decisions of operator parallelization,

resource provisioning, and task mapping. Using a Monitor-Analyze-Plan-Execute (MAPE) architecture, P-Deployer iteratively builds

the connection between performance outcome and resource consumption through task profiling and models the deployment problem

as a bin-packing variant. Extensive experiments using both synthetic and real-world streaming applications have shown the

correctness and scalability of our approach, and demonstrated its superiority compared to platform-oriented methods in terms of

resource cost.

Index Terms—Stream processing, data stream management systems, performance optimization, resource management

Ç

1 INTRODUCTION

DRIVEN by the exponential explosion of unstructuredmachine-generated data and the accompanying thirst

for their timely processing, a new paradigm of in-memoryprocessing—stream processing has emerged, bringing tomarket a plethora of streaming applications ranging fromlog monitoring to financial transaction analysis. Contrary totraditional batch paradigm where data are statically pre-served for the ensuing manipulation, streaming processingintroduces the “process-once-arrival” philosophy: the proc-essing system dynamically accepts possible endless datastreams as input, and then immediately processes themusing continuous queries that are issued once but run con-tinuously over the current portion of incoming data. Queryresults are constantly updated while the input data gener-ated from logs, sensors, networks, and connected devicesactively traverses through the processing system.

For better programmability and manageability, stream-ing applications are developed on top of a specialized mid-dleware—Data Stream Management System (DSMS) toexploit the abstraction of processing primitives and simplifythe use of distributed computing resources. The imperativeprogramming language provided by DSMS hides the low-level complexity of implementations from applications

developers, allowing them to express the continuous queryas a direct graph of inter-connected operators (called topol-ogy hereinafter). In this way, the development burden ofcomplex application logic, e.g., matrix multiplication anditerative data analytic, is significantly relieved. Moreover,the unified data stream management model assists develop-ers with the ability to gracefully partition and route streamsbetween operators, enabling streaming applications toscale-out on a large-scale distributed computing environ-ment in the presence of a higher volume of processing load.

Though the use of DSMS has greatly facilitated the devel-opment of streaming logic, the deployment of such layeredarchitecture (streaming logic, DSMS, and underlying hard-ware) is not transparent to developers. They are responsiblefor deciding how the streaming logic is carried out in adistributed environment to meet the specific processingrequirement, including deciding the number and types ofcomputing resources required by different stages of thestreaming application.

Such deployment decisions depend on the type of the tar-get environment. In the past, it was common to run streamingapplications in a cluster where a combination of computingnodes were pre-configured [1], [2], [3], [4], [5]. As developershave assumed a static environment and only tune the appli-cation configuration towards higher utilization of on-premiseresources, the deployment process in such environment isplatform-oriented. With the emergence of cloud computing,an increasing number of streaming applications are beingmigrated to cloud to exploit the merits of virtualization, suchas on-demand self-service, elastic resource pooling and the“pay-as-you-go” billing scheme. Since the cost of using cloudis based on the actual resource usage, a performance-orienteddeploymentmodel that customizes the resource provisioningwith regard to the actual demands, i.e., enables the streaming

� The authors are with the Cloud Computing and Distributed Systems(CLOUDS) Laboratory, Department of Computing and InformationSystems, University of Melbourne, Parkville, Vic 3010, Australia.E-mail: [email protected], [email protected].

Manuscript received 16 Oct. 2016; revised 29 May 2017; accepted 10 June2017. Date of publication 28 June 2017; date of current version 7 Mar. 2019.(Corresponding author: Xunyun Liu.)Recommended for acceptance by B. He, Y. Chen, and J. Zhou.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TBDATA.2017.2720622

46 IEEE TRANSACTIONS ON BIG DATA, VOL. 5, NO. 1, JANUARY-MARCH 2019

2332-7790� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0002-0998-2034

https://orcid.org/0000-0002-0998-2034

https://orcid.org/0000-0002-0998-2034

https://orcid.org/0000-0002-0998-2034

https://orcid.org/0000-0002-0998-2034

https://orcid.org/0000-0001-9754-6496

https://orcid.org/0000-0001-9754-6496

https://orcid.org/0000-0001-9754-6496

https://orcid.org/0000-0001-9754-6496

https://orcid.org/0000-0001-9754-6496

mailto:

mailto:

application to reach a specific performance1 targetwhilemin-imizing the resource consumption is long overdue.

The overall performance of a streaming application isactually determined by the interplay between a variety ofcontributing factors, which include the implementation ofstreaming logic, characteristics of workload, parameters ofapplication and DSMS, and the sufficiency of resource pro-visioning. During the deployment phase, since the logicimplementation is already given and the workload charac-teristics are not to be controlled by the processing system,proper mapping of the streaming logic to the underlyingresources is the key for the application to accomplish a pre-defined performance target. Specifically, there are threeoptimization problems involved: (1) operator paralleliza-tion, which decides the number of running instances (calledtasks hereinafter) for each operator to partition input loadand realise parallel and asynchronous execution (2)resource provisioning, which estimates the resource usageand provisions the right scale computing power to consti-tute the processing system, and (3) task mapping, whichimplements the scheduling interface in DSMS, resulting in atask mapping to machines that distributes the computationsand transformations derived from the operator logic.

Among the three, operator parallelization and task map-ping have been separately investigated in the existing litera-ture for various optimization targets [1], [3], [6], [7], [8], butresource provisioning is largely ignored in the platform-ori-ented deployment practice. Manually deciding the requiredresources makes the deployment plan inefficient nor swiftto be applied in a cloud environment.

We overcome this limitation by proposing an automated,performance-oriented deployment framework that tacklesthese three optimization problems holistically. The deploy-ment framework, called P-deployer, has the following desir-able features: (1) based on fine-grained profiling informationat the task level, it provisions right-scale execution platformon cloud, as well as decides the operator parallelism andtask mapping configurations to guarantee high resource uti-lization and desired performance outcome; (2) it uses anautomatic and iterative approach to deploy streaming appli-cation, reducing deployment effort for developers to addressthe identified performance bottlenecks; and (3) it is transpar-ent to application logic, i.e., the existing application code canbe deployedwithout any changes.

Our main contributions are summarized as follows:

� By profiling a real-world streaming application, weillustrate some important observations from theexperimentation of different deployment plans,which lay down the foundation for P-deployer.

� We present the design of P-deployer that, to our bestknowledge, is the first system to automaticallydeploy a streaming application on cloud with a pre-defined performance target.

� We model the resource provisioning problem as avariant of bin-packing problem with tasks being

items and machine being bins, where packingtogether two communicating tasks may make themoccupy less volumes than the sum of their individualsize. Our heuristic reduces inter-node traffic whileensuring no computing nodes are overloaded.

� We implement a prototype on top of Apache Storm,and conduct a series of deployment experimentsusing both synthetic and real-world streaming appli-cations to validate the resulted performance and thescalability of our approach. The results confirm thatour framework significantly outperforms the state-of-the-art approaches in terms of resource cost.

2 BACKGROUND

A streaming application needs to be deployed on a particu-lar execution environment before it can accept and processcontinuous data streams. In this section, we give a briefintroduction to the layered structure of streaming applica-tions and discuss the optimization problems involved in itsdeployment process.

As shown in Fig. 1, there are three tiers in the streamingapplication structure. the topology lying in the topmost logi-cal tier is a directed acyclic graph that defines the streaminglogic to be applied on the input data streams. Each vertex isan operator that encapsulates the semantic of a specificoperation, such as filtering, join or aggregation; whereaseach edge represents the direction of data transfer betweenupstream and downstream operators.

The DSMS tier is a middleware system that managesdistributed resource and organises continuous streams tosupport the upper-level streaming logic defined in thetopology. It is the key for the realization of “develop once,use many” concept. Since that a streaming application mayneed to run in different environment to deal with differentvolumes of input, DSMS makes the deployment plan adjust-able, allowing the parallel execution of each operator to be

Fig. 1. Layered structure of an example streaming application.

1. While the context of performance may very under different typesof Quality of Service (QoS), in this paper we refer it as the ability tosteadily handle an input stream of throughput T within an acceptableprocessing latency L.

LIU AND BUYYA: PERFORMANCE-ORIENTED DEPLOYMENT OF STREAMING APPLICATIONS ON CLOUD 47

customized with regard to the specific situation, such as theworkload characteristics, performance target, and the capac-ity of underlying infrastructure.

The first deployment choice incurred in this tier is (1)operator parallelization. Specifically, DSMS treats each opera-tor as a dividable logical entity and parallelizes its executionusing a number of asynchronous tasks. Each task is logicallyequivalent as it handles a subset of the operator input andperforms the same type of operation. During the actual exe-cution, DSMS guarantees the correctness of task coordina-tion and makes sure that the subdivided inner streamswould follow the pre-defined partition scheme. Therefore,developers are only required to specify the degree of paral-lelism for each operator so that it can secure a just enoughnumber of tasks to keep up with its inbound load. We illus-trate this process in Fig. 1 using a dash arrow labelled as“operator parallelization”, which shows that Operator B isparallelized into two tasks: Task4 and Task5.

The underlying infrastructure tier is the place where theparallel execution of tasks actually happens. The construc-tion of this tier involves the other two deployment deci-sions: (2) resource provisioning—where developers select thenumber and types of resources from the elastic resourcepool in the cloud. The streaming application in Fig. 1, as anexample, has a three-node virtual cluster provisioned; (3)task mapping, which assigns the asynchronous tasks to pro-visioned machines in an effort to achieve a particulardeployment target or optimization goal, e.g., reaching a pre-defined throughput target, maximizing distributed resourceutilization, minimizing the overall data processing latency,and etc. In Fig. 1, Task6 of Operator C is assigned to thethird node as indicated by the dash arrow.

These three deployment decisions are highly correlatedin nature. Our motivation is to iteratively optimize them toachieve automatic, performance-oriented, and cost-efficientstreaming application deployment.

3 PRELIMINARIES

We performed a series of deployment experiments on anIaaS cloud to investigate how different deployment deci-sions affect the performance outcome and resource con-sumption of a streaming application.

Our proof-of-concepts experiments were conducted onthe Nectar Cloud2 using 4 “m2.medium” instances in theNCI availability zone (each equipped with 2 VCPUs, 6 GBRAM and 30 GB root disk). On this virtual cluster, wedeployed a word count streaming application built on topof the well-known DSMS—Apache Storm3 0.10.0 using vari-ous deployment plans.

The topology of word count depicted in Fig. 2 consists offour operators: the first operator, Kestrel Spout, pulls datafrom a message queue server and generates a continuous

stream of tweets as its output. The second operator, JSONParser, parses the stream and extracts the main messagebody. Sentence Splitter divides the main body of text into acollection of separate words, and finally, Word Counter isresponsible for the final occurrence counting.

In order to evaluate the efficiency of different deploy-ment plans, we have set up a profiling environment thatfeeds the streaming application with a controllable size ofinput stream and constantly monitors the applicationbehaviours under that given pressure. For the sake of resultstability, all the measurements of performance outcomesand resource usages are averaged using 5 consecutive read-ings of the corresponding value, which are all collected afterthe application is stabilized. The observations and corre-sponding analysis are summarized as follows.

Observation 1: resource consumption of a task is positively cor-related with its stream load and the ratio of inter-node communi-cations. We consider the resource consumption in twodimensions: memory and CPU. Unlike memory consump-tion that can be measured in megabytes; the definition ofCPU consumption is puzzlingly vague due to the diversityof operating systems and CPU architectures. Following themethod used by the resource-aware scheduler in Storm [9],we adopt a point-based system to describe the amount ofCPU resources available on a node or demanded by a par-ticular task. Typically, an available CPU core gets 100 pointsand a multi-core machine could get num of cores � 100resources points in total, whereas a task that occupies x%CPU usages reported by the monitoring system requires xpoints accordingly. Besides, we define the stream load of anoperator (or a task) as the amount of stream data that passedthrough this component during a unit period of time. Dueto the DAG organization of the topology, the measurementof stream load can be calculated by summing up thethroughputs of all its incoming streams.

Our first experiment tracks the resource consumption ofa task when it processes different sizes of stream load. Wedeployed the word count application using a spread-outstrategy, where each operator has only one task and everytask is mapped to a separate node. The results showed that,for all the examined tasks, the resource consumptionincreases almost linearly with the size of stream load. Partic-ularly, those CPU-bound tasks reach their maximum proc-essing capability when they have fully occupied theavailable CPU resources on a single core.

The second experiment investigates how the resourceconsumption of a task is influenced by the ratio of inter-node communication. For each examined task, we keep thestream load unchanged at a fixed rate (500 tuples4/s) andconstantly vary its inter-node communication ratio by paral-lelizing the upstream and downstream operators and prop-erly placing the spawned tasks on different machines. Forexample, suppose we would like to probe the resource con-sumption of a Sentence Splitter task TaskS1 with a inter-node communication ratio set to 50 percent, Fig. 3 illustrateshow we can deploy the word count application to achievethis goal. Specifically, we parallelize JSON Parser and WordCounter into two tasks (TaskJ1, TaskJ2), (TaskW1, TaskW2),but only put TaskJ2 and TaskW1 on the same node of TaskS1

Fig. 2. The topology of word count streaming application, where the solidarrows represent data streams.

2. https://nectar.org.au/research-cloud/3. http://storm.apache.org/ 4. Tuple is a single datum in stream that is processed.


http://storm.apache.org/

collocating with the examined task. If we assume that theload is balanced between the sibling tasks that constitutethe same operator, TaskS1 would have half of its data com-munication transferred through the inter-node network.Once again, the results have shown a linear increase ofresource consumption along with the increasing inter-nodecommunication ratio, indicating that minimizing the inter-node traffic is also quantitatively profitable from the pro-spective of resource savings.

Observation 2: overly parallelizing an operator on a singlemachine greatly hurts its performance. We conducted thisexperiment to break the myth that higher operator parallel-ism would result in better performance, i.e., fine-grainedparallel execution is always encouraged for an operator toincorporate as many tasks as possible to partition its streamload. In this experiment, we chose JSON Parser as the exam-ined operator and tested three parallelization configura-tions: the first one sets up 2 tasks and puts them on aseparate node to avoid interference from other operators;analogously, the second one initializes 8 tasks and the thirdone creates 16 for comparison. To observe the performancedifferences resulted from different configurations, we pro-vide the streaming application with a sufficiently largeinput stream and make sure that other operators will not bethe bottleneck of the topology.

From the results we observe that the first configurationyields the best performance among the three (35 percenthigher throughput than the second and 69 percent higherthan that of the third setting). This is because spawning twotasks for JSON Parser is already adequate to make full useof the two-core machine, while having 8 or even 16 tasks onthis node only imposes additional resource costs of threadscheduling and context switching, —for the third settingparticularly, half of the CPU time is spent running the ker-nel rather than user space processes, which greatly impairsthe trafficability of JSON Parser.

3.1 Assumptions

In addition to the observations we made from the experi-ments, we make the following assumptions to build anautomatic deployment framework:

(1) Tasks of the same operator fairly process the sameamount of workload. In other words, the streamload of an operator can be equally partitioned to allits constituent tasks.

(2) The inter-node communication cost for each task iscalculated from its bidirectional bandwidth usages,which is in line with the literature convention [1],[6], [9], [10].

(3) The same type of machines are selected on the IaaScloud to form a homogeneous infrastructure.

4 P-DEPLOYER OVERVIEW

P-Deployer’s design follows a typical Monitor-Analyze-Plan-Execute (MAPE) architecture and it works in a profil-ing environment to find the desirable deployment plan. Theconstruction of profiling environment serves two major pur-poses: (1) information collection: it allows P-Deployer toprobe the runtime characteristics of both streaming applica-tion and underlying cloud platform, thus collecting neces-sary information to model the performance behaviour aswell as its corresponding resource consumption. (2) deploy-ment verification: it verifies the proposed deployment planagainst the pre-defined performance target by actual execu-tion, i.e., examining how the deployed streaming applica-tion performs under a profiling stream that mimics thesituation of processing the maximum throughput.

The profiling environment, as shown in the bottom halfof Fig. 4, consists of a message generator, a message queue,and a streaming application to be run in the IaaS cloud envi-ronment. The message generator is able to produce datastream with a speed specified by P-Deployer, where all thedata used in this stream is collected from the productionphase to simulate the real workload. The profiling input isthen directed to the message queue, which works as a mes-sage buffer to avoid overwhelming the streaming applica-tion in case its processing capability cannot catch up withthe performance requirement. The streaming applicationingests the profiling stream from the message queue withits layered structure shown in the rightmost frame. Along-side the streaming application, there are also some platformdependent modules that connect P-Deployer to the profilingenvironment. These include the Metric Reporter, theResource Manager and the Application Submitter, that arerespectively responsible for collecting the current perfor-mance metrics, provisioning/relinquishing cloud resourcesand submitting the streaming application to DSMS accord-ing to the specific deployment plan.

Fig. 3. Partial deployment sketch that shows how to designedly assignthe inter-node communication ratio of TaskS1 to 50 percent. Solid arrowsrepresent data streams while dash arrows denote the “operator-task”containment relationship.

Fig. 4. The Monitor-Analyze-Plan-Execute (MAPE) architecture ofP-Deployer and its working environment.


On top of the profiling environment, P-Deployer has aniterative working flow to implement the MAPE loop, witheach iteration proposing a concrete deployment plan andthen validating its capability through the profiling ofdeployed application. This iterative process continues untilthe desirable deployment plan is found, or the cost ofresource provisioning has already exceeded the user bud-get. Specifically, by retrieving the output of the Metricreporter, the Monitoring Module measures how the stream-ing application is performing under the profiling load. ThePerformance Analyser conducts some boundary checks onthe set of metrics and determines whether the performancetarget has been met or not. If further adjustment is required,it hands in the collected runtime information to the Deploy-ment Planer and indicates the possible cause of perfor-mance issue. Then, the Deployment Planer, as shown in thegrey box, will update the inaccurate model inputs andmake a new deployment plan in an attempt to remedythe performance bottleneck. The executors of P-Deployerare embedded within the cloud platform, following thedeployment plan’s instruction to control the speed of datageneration, manage the cloud resources, and re-submit thestreaming application for the next round of evaluation.

The essence of P-Deployer lies in the planning phase ofthe MAPE loop, which is to propose a holistic deploymentplan based on the profiled runtime information. We brieflysummarise this phase in three steps:

(1) Building task profile: modelling the characteristics ofeach operator by profiling one of its tasks. The taskprofile essentially depicts the relationship betweenthe desired performance behaviour and the esti-mated resource consumption at task level, which isthe most fine-grained level of DSMS. Note that allthe tasks of a single operator share the same task pro-file due to their homogeneity.

(2) Operator parallelization: deciding the parallelismdegree for each operator based on its stream loadand the profile of its tasks. The result of this step is aset of tasks along with their requested resourceconsumptions.

(3) Resource estimation and task mapping: formulating thedeployment optimization problem as a bin-packingvariant, the solution to which will estimate the over-all resource demands and produce the mapping

result at the same time. The goal is to minimizethe number of used machines while satisfying theresource needs of collocating tasks.

5 DEPLOYMENT PLAN GENERATION

In this section, we discuss in detail how the planning stepsare carried out to holistically decide operator paralleliza-tion, resource provisioning and task mapping.

5.1 Building Task Profile

As introduced in Section 3, we consider two types of resour-ces in this work—memory and CPU, which are separatelymeasured in megabytes and a point-based approach. Adeployment plan is considered feasible only if the aggre-gated resource demands of collocated tasks in each nodecan be satisfied in both dimensions. Therefore, based on theanalysis of collected information, we build a profile for eachtask to reflect the relationship between the performancebehaviour and the corresponding resource consumption. Itallows us to estimate how many resources this task wouldrequire to reach a specific performance target, i.e., to processa certain size of stream load and to send/receive a certainamount of messages remotely.

The intuition behind task profile is illustrated in Fig. 5.Performance-oriented deployment has a prerequisite toproperly translate the performance demand of a streamingapplication into its predicted resource consumption. How-ever, there is no theoretical model nor empirical work thatcapable of directly achieving this goal, mainly because thelayered application structure has much complicated therelationship between performance and resource usages.To fill in the gap, we establish this translation by buildingup a profile for each task, leading to a working process asfollows: through operator parallelization, we break downthe operators in topology into a set of tasks, meanwhile theapplication-level performance demand is also decomposedinto the performance requirement of each task according tothe stream balancing assumption we have made; as the taskprofile has modelled the resource consumption individuallyfor each task, we estimate the overall resource consumptionof the entire application as a result of task mapping byaggregating the resource needs of collocated tasks.

Table 1 enumerates the attributes of a task profile. Sojourntime tsoj is a period of time that a tuple stays within the taskfor being processed; since we model each task as a single-thread entity, the sojourn time of different tuples cannot beoverlapped. The CPU consumption of handling a tuple con-sists of two parts: the processing cost cp that is spent on

Fig. 5. Translating the performance demand of a streaming applicationinto the estimated resource consumption through its task profiles.

TABLE 1The Attributes of a Task Profile

Symbol Description

tsoj Sojourn time of a tuple for being processedcp CPU consumption of processing a tuplect CPU consumption of transmitting a tuplempt Memory consumption of processing & transmitting a tupleni; no Number of input (output) streamsSk Size (throughput) of the input stream k; k ¼ 1; . . . ; ni

S0k Size (throughput) of the out stream k; k ¼ 1; . . . ; no


executing the streaming semantic, and the transmission costct that is consumed for network-related activities, such asserialization/deserialization, message buffering and per-forming the actual send/receive operation. Note that weonly count the inter-node communication in the calculationof transmission cost. This is because intra-node communica-tion normally happens within the shared memory and isbacked up by high-performance thread-messaging libraries(e.g., LMAX Disruptor), thus incurring negligible overheadcompared to the network-based communication.

On the other hand, mpt denotes the memory footprints ofhandling a tuple. It is not necessary to differentiate whetherthe memory is consumed by data processing or transmis-sion, because there is little memory allocated for internalmessage buffers in order to avoid high queuing latency.Only memory-intensive computations, such as large win-dowed joins or cache-based analytic algorithms, can resultin a considerable amount of memory usages that mightdefine the machine characteristics of provisioned resources.

5.1.1 Modelling Task Resource Consumption

For task t that has ni input streams with sizes denoted asfS1; S2; . . . ; Snig and no output streams of sizes in fS01; S

02; . . . ;

S0nog, its CPU consumptionCt can bemodelled in

Ct ¼Xni

k¼1Skcp þ

X

k2QSk þ

X

k2QS0k

!ct

Ct;min ¼Xni

k¼1Skcp

Ct;max ¼Xni

k¼1Skcp þ

Xni

k¼1Sk þ

Xno

k¼1S0k

!ct;

(1)

where Q indicates the set of inter-node communication, i.e.,k 2 Q means that the input stream k is received from (or theoutput stream k is sent to) a task that locates in anothernode. Depending on the final location of task t, its CPUconsumption varies from the minimum value Ct;min, whenQ is empty, to the maximum value Ct;max, when all its com-munication happens across network.

Analogously, we model the memory consumption of thistask in

Mt ¼Xni

k¼1Sk þ

Xno

k¼1S0k

!mpt: (2)

5.2 Operator Parallelization

In light of the observation that over-parallelization hurts theoperator performance (see Section 3), we adopt a minimalparallelism strategy that each operator only spawns theleast number of tasks to keep up with its performancerequirement. Therefore, the parallelism degree of an opera-tor is determined by two factors: the stream load of thisoperator under the profiling input, which can be seen as aperformance requirement obtained from the monitoringmodule; and the maximum processing capacity of its con-stituent task, which can be calculated using the task profileestablished in the previous step.

Specifically, the maximum processing capacity of a taskrefers to the largest stream load it can sustain during theruntime, which is determined by the implementation of the

stream logic as well as the capability of the execution envi-ronment. Having modelled the resource consumption on aper tuple basis, the task profile reveals the confining resour-ces that prohibit the entity from achieving higher through-put on this particular platform. In this sense, we provide aclassification for different tasks based on the type of confin-ing resources, and then discuss how to parallelize an opera-tor op with tasks in one of these categories to achieve theminimal parallelism objective. The symbols used in this sec-tion are summarised in Table 2:

� CPU-bound: task of this kind consumes a consider-able amount of CPU resources to process a singletuple, such as multiplying small matrices formachine learning algorithms or performing iterativecalculation for optimization purposes. CPU-boundtask can utilize at most 100 point CPU resources forcovering the processing cost cp due to its single-thread nature, meaning that the maximum process-ing capacity is reached once it has fully occupied theCPU core for processing streams. Therefore, for anoperator op whose tasks are CPU-bound and has astream load of Sop, its parallelism degree can bedecided as follows:

PopðCPU-boundÞ ¼ Sop

100=cp

� �:

� I/O-bound: contrary to CPU-bound, I/O-bound tasksspend more time on waiting for the I/O operationrather than performing the actual computation. Itsdistinctive identifying characteristic is that ct out-weighs cp significantly in the task profile. Event logprocessing, for example, incorporates typical IO-bound tasks that ingest a large size of log streamwhile only applying filtering or some other trivialtransformations on it. The previous best practice inplatform-oriented deployment has shown that a CPUcore should host several IO-bound tasks to makeefficient use of the network connection,5 therefore,we model the maximum processing capability forthis kind of tasks by limiting the CPU resources it canget for data transmission. If the threshold, denoted asa, is set to 0.1, then at most 10 I/O-bound tasksare permitted to occupy a CPU core for data

TABLE 2Symbols Used for Operator Parallelization

Symbol Description

Sop Monitored stream load of operator op

Pop Parallelism degree (number of tasks) of op

cp; ctmpt; tsoj

� �Task profile attributes that shared between all

tasks constituting operator op

a Upper limit on the CPU usages for a single I/

O-bound task to perform data transmission

b Upper limit on the memory usages for a single

memory-bound task to occupy

5. http://www.slideshare.net/ptgoetz/scaling-apache-storm-strata-hadoopworld-2014


http://www.slideshare.net/ptgoetz/scaling-apache-storm-strata-hadoopworld-2014

http://www.slideshare.net/ptgoetz/scaling-apache-storm-strata-hadoopworld-2014

transmission. Accordingly, we calculate the parallel-ism degree for I/O-bound operators as follows:

PopðI/O-boundÞ ¼ Sop

a=ct

� �:

� Sojourn time-bound: some tasks need to invoke anexternal service to complete a tuple transaction, suchas connecting to a remote database or calling a third-party API. These operations often require no CPUand memory consumption but may result in a sub-stantial sojourn time for each tuple to be processed.In such case, the maximum processing capacity ofeach task is bound by the wall clock time, and thusthe operator parallelization is conducted as follows:

PopðSojourn time-boundÞ ¼ Sop � tsoj� �

:

� Memory-bound: as the stream load increases, sometasks may use a significant amount of memory tomaintain a large window cache or store enormousintermediate results. Similar to the IO-bound task,we limit the maximum memory usage for a singletask (denoted as b) and in turns deduce that howmany tasks are needed for a memory-bound opera-tor to sustain a stream load of Sop

PopðMemory-boundÞ ¼ Sop

b=mpt

� �:

Apart from the decision on the number of tasks, Oper-ator Parallelization also models the resource consump-tion of each spawned task by taking into account of itsshare of stream load as well as the associated task pro-file. However, there is only a range estimation in termsof the CPU usages, as the actual value varies dependingon the task location due to different inter-node commu-nication costs.

5.3 Resource Estimation and Task Mapping

The problem we are trying to solve is to assign tasks tomachines such that (1) the aggregated resource consump-tion of collocated tasks is satisfied, and (2) the cost ofresource provisioning is minimized. Specifically, each taskcan be considered as an item of multi-dimensional volumesthat are characterised by its resource requirements, whileeach machine is a bin of the same size as per Assumption 3made in Section 3.1. The optimization target of this problemis to minimize the number of used machines. Therefore, thisis a variant of bin-packing problem that can be formalizedin the linear programming form.

Table 3 summarizes the newly introduced symbols inthis section.

5.3.1 Problem Definition

Given a positive integer number of machines (bins) withCPU capacity Wc and memory capacity Wm, and a list of ntasks (items) t1; t2; . . . ; tn with their CPU demands andmemory demands denoted as Cti ;Mti ði 2 1; 2; ::; nÞ, respec-tively, the problem is formulated as follows:

minimizeXK

k¼1yk

subject toXK

k¼1xi;k ¼ 1; i ¼ 1; . . . ; n;

Xn

i¼1Ctixi;k �Wcyk; k ¼ 1; . . . ; K;

Xn

i¼1Mtixi;k �Wmyk; k ¼ 1; . . . ; K;

where K is an upper bound on the number of machinesneeded, and the variables yk; xi;k are

yk ¼1 if machine k is used;

0 otherwise;

�

xi;k ¼1 if task i is assigned to machine k;

0 otherwise;

�

The uniqueness of this problem lies in the fact that pack-ing two communicating tasks together on the same machinewill result in less resource consumption than the sum oftheir individual demands. This characteristic is reflected inthe calculation of Cti as shown in Eq. (3), which is derivedfrom Eq. (1) but making use of the monitored stream loadsand the result of operator parallelization to deduce the sizesof input/output streams for task t

Cti ¼Xni

k¼1

Sk

Popk � Popcp

Xni

k¼1

Sk

Popk � PopðPopk �AdjkÞ

þXno

k¼1

S0k

P 0opk � PopðP 0opk �Adj

0kÞ!ct:

(3)

The variables used in Eq. (3) are illustrated in Fig. 6: ti isthe examined task affiliated with operator op. According tothe application topology, op has ni upstream operatorsfop1; . . . ; opnig and no downstream operators fop01; . . . ; op

0nog.

The size of input stream k between operator opk and op isdenoted as Sk, and it can be equally partitioned into commu-nications between constituent tasks of opk and op based onthe stream balancing assumption. The parallelism degree ofopk is denoted as Popk , and Adjk is the number of spawnedtasks among them that are collocating with ti on the same

TABLE 3Symbols Used for Resource Estimation and Task Mapping

Symbol Description

Wc CPU capacity of provisioned machineWm Memory capacity of provisioned machinen Number of tasksti Tasks to be assigned, i ¼ 1; . . . ; nK Upper bound on the number of machines used

Symbols used in Eqs. (3), (4) and shown in Fig. 6

opk Upstream operator k of op, (k ¼ 1; ::; ni)op0k Downstream operator k of op, (k ¼ 1; ::; no)

Sk Size of input stream k between opk and op, (k ¼ 1; ::; ni)S0k Size of output stream k between op and opk

0, (k ¼ 1; ::; no)Adjk Number of tasks of opk collocating with task tiAdj

0k Number of tasks of op

0k collocating with task ti


node. Similar notations also apply to the downstream opera-tors for the convenience of presentation.

Using the notations illustrated in Fig. 6, we also presentthe formulation of memory consumption for task t asbelow:

Mti ¼Xni

k¼1Sk þ

Xno

k¼1S0k

!mpt

Pop: (4)

In order to minimize Cti , putting successive tasks on thesame node is always encouraged as long as the target nodestill has sufficient capacity to accommodate their aggregatedresource demands. Therefore, by modelling the problemas a special case of two-dimensional vector bin-packing,we have already considered the common requirement oftask placement—reducing the inter-node communicationwhenever possible. The solution to this problem should be atrade-off between consolidating tasks and preventingresource contention in each node.

As for the problem complexity, the classical bin-packingis already NP-Hard as reduced from the PARTITION prob-lem [11]. However, with overlapping tasks whose resourceconsumption depends on the packing result, our task map-ping problem is a more complicated variant that must relyon the use of approximation algorithms to make it tractable.

5.3.2 Heuristics for Solving Bin-Packing Problem

We opt for heuristic methods mainly because of efficiencyconsiderations. The scale of the problem increases alongwith the application performance requirement, which mayinvolve thousands of tasks to be assigned. Having such ahuge solution space, it is computational infeasible to searchfor the optimal result by the use of exact algorithms such asbin completion (BC) [12] and branch-and-price (BCP) [13].Additionally, packing speed is of crucial importance for thefast deployment of streaming application. P-Deployer mayneed to invoke the bin-packing process multiple times, witheach time the result being verified through profiling as satis-fying the model constraint does not necessarily guaranteethat it can satisfy the performance requirement. Therefore,we prioritise execution efficiency in the heuristic design,and present a lightweight implementation to make it a goodfit for the real-time streaming context.

The proposed solution is analogous to First Fit Decreasing(FFD), which is one of the most natural heuristics for bin-packing and is known to be effective in 1-dimensional casesas it is guaranteed to produce a result using less than119 OPT þ 1 bins (OPT is the optimal solution). FFD is essen-tially a greedy algorithm that sorts the items in a particularorder (normally by descending sizes) and then sequentially

places them in the first bin that has sufficient capacity. How-ever, in order to cope with the multi-dimensional and over-lapping nature of our problem, this process has to begeneralized in three aspects which is shown in Algorithm 1.

As CPU and memory are measured in different metrics,Algorithm 1 first normalizes the task resource demands Ct

and Mt with regards to machine resource availability Wc

and Wm for the ease of comparison in resource scarcity.After this step, each machine can be considered as a bin ofunit size and each task is denoted by a 2-d decimal vector.

Second, there are two functions introduced to make dif-ferent tasks comparable in terms of their packing priority.These include: (1) a resource saving function sðt;mÞ thatcalculates the amount of resource savings if task t is to beplaced on machine m; and (2) a priority function pðt;mÞthat assigns task t a scalar considering not only the intuitionof putting the “largest” item first, but also the inclination tomaximize the potential resource savings.

Algorithm 1. The Task Mapping and Resource Estima-tion Algorithm

Input: A task set~t ¼ ft1; t2; . . . ; tng to be assignedOutput: A machine set ~m ¼ fm1;m2; . . . ;mnmg with each

machine hosting a disjoint subset of~t, where nm isthe number of used machines

1: Normalize resource demands for~t;2: nm 0;3: while there are tasks remaining in~t to be placed do4: Start a new machinem;5: Increase nm by 1;6: while there are tasks that fit in machinem do7: foreach t 2~t do8: Calculate sðt;mÞ (Eq. (5));9: Calculate pðt;mÞ (Eq. (6));10: end11: Place the task with the largest pðt;mÞ into machinem;12: Remove the task from~t;13: Update the remaining capacity of machinem;14: end15: end16: return ~m;

Lastly, since the calculation of sðt;mÞ actually depends onwhat have already been put into machinem, we adopt a dif-ferent view to implement this FFD variant. Contrary to theclassical “item-centric” FFD implementations that require alltasks to be sorted beforehand and then packed strictly in thepre-calculated order, we implement a different perspectiveof FFD, called “bin-centric”, which allows the packing prior-ity of remaining tasks to be dynamically updated after eachtask assignment in order to take the current system statusinto consideration. The task with the largest priority at themoment will be packed next and varied definition of prioritywould lead to a different FFD heuristic.

Implemented in this way, there is only one machineopened at any time and the algorithmkeeps filling itwith suit-able tasks until there is not enough capacity for any remainingtask. The rest of this section explains in detail how the algo-rithmdecides the priority order among those “suitable tasks”.

Calculating sðt;mÞ.When task t is to be placed onmachinem, any previously packed task on this machine that is

Fig. 6. The illustration of variables used in the calculation of resourceconsumption for task ti, where tasks in grey indicate that they are collo-cated on the same machine.


adjacent to twill benefit from this assignment by increasing 1to a particular position of its ~Adj list, which causes their CPUconsumption to decrease according to Eq. (3). Note that ~Adjis initialized to all zeros for every task, implying that allunpacked tasks are considered to be external by defaultwhen calculating Eq. (3). Analogously, the CPU demand of tis also reduced due to task allocation. Therefore, sðt;mÞ canbe formulated as follows, where tk 2 }ðmÞ represents thetasks that have already been packed inmachinem

sðt;mÞ ¼ Ct;max � Ct þX

tk2}ðmÞðC ~Adj

tk� C

~Adjþ1tk

Þ: (5)

Calculating pðt;mÞ. The task priority function is designedbased on the following considerations:

(1) Resource saving is prioritised as it encourages com-municating tasks to be packed in a tightly manner.

(2) If resource savings are similar or there is no savingfrom the placement of the rest tasks, the heuristicpicks the one that best fills the remaining capacity ofthe opened machine.

(3) Each resource dimension is properly weighted toreflect the relative scarcity, so that when the map-ping problem is in essence confined by one type ofresource, the heuristic will be dominated by thisdimension and reduced to 1-d FFD when necessary.This can be achieved by assigning a larger coefficientto the scarce dimension.

Therefore, pðt;mÞ is formulated as follows:

pðt;mÞ ¼ ar � sðt;mÞ � acðCt � rcpum Þ2 � amðMt � rmemm Þ2; (6)

rcpum ; rmemm represent the remaining capacities of machine m,

and ar; ac; am are weight coefficients that adjust the impor-tance of each term. While ar is chosen as a user parameter,ac; am can be calculated as the average task demand in eachdimension: ac ¼ 1

n

Pnk¼1 Ctk ; ac ¼ 1

n

Pnk¼1 Mtk .

6 IMPLEMENTATION

The setup of the profiling environment has been brieflyintroduced in Section 4. More specifically, the Message Gen-erator in Fig. 4 is a Java program that reads the workloadfile on-demand to emit a particular size of profiling stream;while the Message Queue is a distributed queueing systemimplemented on Twitter Kestrel6 that enables controllablemessage buffering. The use of Thrift interface of Kestrelallows P-Deployer to easily retrieve the length of the mes-sage queue and further determine whether the streamingapplication has been overwhelmed by the profiling data.

Following the MAPE working process, Fig. 7 describeshow P-Deployer is integrated with Apache Storm in ourprototype. Apache Storm is selected as our target DSMS notonly because it is widely adopted in both academia andindustry, but also it offers a built-in metric system and Flux- an external configuration reader that greatly facilitates theapplication of versatile deployment schemes.

Monitor. The Metric Reporter in Fig. 4 is actually imple-mented by two components that collect system level andapplication level metrics, respectively. The collected

information is then stored in MongoDB7 and processed inorder to determine the task profile attributes listed in Table 1.The Low Level Metric Reporter running in each WorkerNode consists of an external statistics collection daemon—collectd8 and several extended Storm modules. Specifically,collectd runs alongside the Supervisor daemon to probe theCPU and memory utilization of the Worker Process with aresolution of 10 seconds. In the storm-core, we implementthe Task Wrapper that encapsulates the task executionwith the logic of sampling and reporting resource consump-tion at the task level: the CPU consumption of operators’ exe-cute method (cp) is probed using the ThreadMXBean class,while the memory consumption (mpt) is obtained throughretrieving the size of the operator state. Recalling the factthat we classify the overall CPU consumption obtained at theprocess level into processing cost and transmission cost, theCPU consumption of tuple transmission (ct) is a derivedmet-ric that requires the comparison between the task level statis-tics and the process level statistics. To avoid excessiveprofiling overhead, P-Deployer sets the default samplingrate to 0.05, i.e., selecting 1 out of 20 tuples to collect andreport metrics. Also, we provide the Topology Adapter instorm-core that seamlessly hides the adaptation effort on theapplication level to leverage this profiling framework.

The High Level Metric Reporter, on the other hands, col-lects metrics on the application performance. Some of themare directly obtained from the UI daemon onNimbus, such asthe stream load between different operators (Sk; S

0k), the

sojourn time of task profile (tsoj), and the application completelatency (average time taken for a tuple and all its offspring tobe completely processed by the topology). Some metrics,however, need specific post-processing. For example, thereis no default definition for throughput, therefore, the reportercalculates the overall throughput of a streaming applicationbased on its monitoring interval as well as the observationson the accumulative number of acknowledgements or emit-ted data, depending on whether the application adopts reli-ablemessage processing or not.

Analyse and Plan. P-Deployer, implemented as a singleJava program, comprises both the analytical and planning

Fig. 7. The integration of P-deployer into apache storm.

6. https://github.com/twitter-archive/kestrel7. https://www.mongodb.com/8. https://collectd.org/


https://www.mongodb.com/

functionalities. It queries the MongoDB for the latest metricsand applies a set of boundary check rules to define thecurrent application state and update the task profile accord-ingly. The newly updated task profile will trigger the plan-ning phase to be performed again, resulting in a deploymentplan that specifies the number of machines used and theassignment location of each task.

Execute. Changes to the virtual infrastructure are madepossible by using Apache jclouds,9 where any new provi-sioned machine is initialized from a image that already hasStorm pre-configured. For actual application deployment,P-Deployer uses an external YAML file to specify the paral-lelism degrees of operators and submit the streaming appli-cation to Nimbus by invoking the Storm CLI. The extendedMeta-based scheduler guarantees that each task is assignedto its designated Supervisor and Work Node.

7 PERFORMANCE EVALUATION

To validate the correctness and efficiency of the proposed pro-totype,we have conducted three different sets of experiments:

(1) The applicability evaluation validates whether P-Deployer is capable of deploying a variety of stream-ing applications towards their pre-defined perfor-mance targets. The test applications incorporatedifferent topology structures in order to verify theapplicability and robustness of P-Deployer.

(2) The scalability evaluation shows the runtime behav-iour of P-Deployer using relative large test cases,where the application to be deployed has a morecomplicated topology structure or a higher perfor-mance target. The runtime overhead of P-Deployeris also assessed in this experiment.

(3) The cost efficiency evaluation compares P-Deployerwith the state-of-the-art platform-oriented method interms of resource usages, as they endeavour toachieve the same performance target in deployment.

7.1 Experiment Setup

The experiment environment is set up on a private cloudsupported by OpenStack, which is located in CLOUDS labat the University of Melbourne. The environment consistsof three IBM X3500 M4 machines, and each machineis equipped with 2 x Intel Xeon E5-2620 Processor(6 [email protected] GHz), 64 GB RAM and 2.1 TB HDD. The virtualcluster deployed on the physical environment is composedof a Nimbus Node, a ZooKeeper Node and several on-demand Worker Nodes. All machines are provisioned fromthe same “m1.medium” template (2 VCPU and 4 GB RAM).

The used streaming applications (a.k.a topologies) andthe evaluation methodology are discussed below.

7.1.1 Testing Topologies

There are five testing topologies—three synthetic and twodrawn from the real-world streaming scenarios.

Micro-Benchmark. The synthetic topologies, collectivelycalled micro-benchmark, evaluate how P-Deployer general-izes to different topology structures. As shown in Fig. 8,micro-benchmark includes three common structures: Linear,Diamond, and Star, covering the cases where an operator has(1) one-input-one-output, (2) multiple-outputs or multiple-inputs, and (3)multiple-inputs-multiple-outputs, respectively.

The execute method of each operator is implemented inone of the four patterns in order to reflect different operatortypes. Specifically, CPU bound operators invoke a randomnumber generation method Math.random() 30000 times togenerate a significant amount of CPU consumption, whileI/O bound operators only apply a JSON parse operation onthe incoming tuple for fast processing; Sojourn time-boundoperators sleep for 10 milliseconds upon any tuple receipt,and Memory-bound operators temporarily store any mes-sage received, maintaining a sliding window with 300 sec-onds length and 60 seconds sliding interval.

To satisfy the stream balancing assumption, all operatorsare connected through shuffle-grouping—a stream routingmechanism that evenly partitions internal streams across thereceiving tasks. Besides, to generate a relative large internalstream for I/O bound operators to mimic saturated networkusages, each operator has a function implemented to adjustits operator selectivity10 using an external configuration file.

Word Count. Please see Section 3 for its description.Twitter Sentiment Analysis. This topology, as shown in

Fig. 9, is adapted from a mature open-source project hostedon Github.11 It has 11 operators constituting a tree styletopology that has eight stages in depth. In terms of process-ing logic, once a new tweet is pulled into the system(through Op1, a Kestrel Spout), it is first preserved by a filewriter (Op2) and examined by a language detector (Op3) toidentify which language it uses. If it is written in English,there is a sentiment analysis operator (Op4) that splits thesentence and calculates the sentimental score for the wholecontent using AFINN,12 which contains a list of words withtheir pre-computed sentiment valence from minus five(negative) to plus five (positive). There are also several oper-ators to count the average sentiment result (Op5, Op6) and torank the most frequent hashtags occurring over a specifictime window (Op7 � Op11).

Note that all the above-mentioned topologies process thesame type of workload, as the profiling stream is generated

Fig. 8. The synthetic Micro-benchmark topologies, the structures ofwhich are referenced from R-Storm [9].

Fig. 9. The Twitter sentiment analysis topology.

9. http://jclouds.apache.org/

10. Selectivity is an operator property that denotes the number ofstream data produced per data consumed.

11. https://github.com/kantega/storm-twitter-workshop12. http://www2.imm.dtu.dk/pubdb/views/publication_details.

php?id=6010


http://jclouds.apache.org/

https://github.com/kantega/storm-twitter-workshop

http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

from the same workload file containing 159,620 tweets inJSON format that collected from 24/03/2014 to 14/04/2014.Topologies are also configured with acknowledgementsenabled so as to track the complete latency during the runtime.

7.1.2 Evaluation Methodology

For quantitative comparison of different deployment plans,we introduce the metrics considered as well as the measure-ment method in our experiments.

Deployment Metrics. Throughput and complete latencyare the two performance metrics that evaluate the quality ofdeployment. We deem a deployment plan to be perfor-mance satisfactory, as long as it results in a higher through-put than the pre-defined target whilst meeting the latencyconstraint.

The number of used machines is the only metric thatevaluates the cost-efficiency of the proposed deploymentplan. In this platform, the user budget of deployment is setto 16 Worker Nodes, which is the maximum capacity of ourprivate cloud if we make sure that each VCPU correspondsto a physical core.

Measurement Method. The application is first deployed onthe Storm cluster according to its designated plan. Duringthis process, the number of tasks is set to be same as thenumber of executers and each Worker Node has only oneWorker Process, which conforms to the recommendationsettings provided by the Storm community.13 The deployedtopology will execute for 15 minutes to sufficiently stabilizebefore any measurements are taken. To obtain an averageperformance result, performance data are collected every30 seconds for consecutively 5 minutes. Therefore, eachiteration of the MAPE loop has a timespan of 20 minutes.

The performance metric needs to report the maximumthroughput sustained by the deployed application. To thisend, the profiling environment feeds the application withenough size of inputs, lifting the performance to the higheststable point before comparing it with the desired target.

For clarity, the settings of model parameters used in thedeployment plan generation are summarised in Table 4.

7.2 Applicability Evaluation

In this evaluation, we use P-Deployer to deploy all the fivetopologies towards a designated performance target—athroughput of 2,000 tweets/second and a maximum com-plete latency of 500 ms.

To better examine the applicability of P-Deployer, wehave configured the micro-benchmark topology with differ-ent time and resource complexities. In particular, the Lineartopology includes only CPU-bound operators so that thewhole topology is bound by computation; the Star topology,on the contrary, is bound by communication for being made

of I/O bound operators with comparatively large internalstreams; while the Diamond topology incorporates all typesof operators in the middle so as to simulate a hybrid case.

Fig. 10 demonstrates that P-Deployer has been able tosteadily improve the topology throughputs and finally real-ize the pre-defined performance target. The evaluated topol-ogies, despite their different topology structures andresource complexities, have shown a similar scaling patternduring the deployment process: in the first iteration, themicro-benchmark topologies respectively achieve 78, 68, and75 percent of the performance target; while the Twitter Senti-ment Analysis delivers a average throughput of 1237tweets/second that only accounts for 62 percent of therequirement. The followingMAPE iterations essentially leadto a horizontal scaling up process. P-Deployer graduallyexaggerates the unit resource consumption reflected in tasksprofiles, resulting in a higher operator parallelism and moreWorker Nodes to be added into the Storm cluster. The bin-packing nature of the task mapping algorithm guaranteesthat only a necessary amount of machines would be intro-duced and efficiently utilized in the next MAPE iteration.Taking the Linear topology as an example, it initially has aparallelism degree of (1,1,1,1)14 running on 3 Worker Nodes,but at the end of the scaling process, it spreads over6 machines with a parallelism degree of (1,2,2,2), and P-Deploy took care not to collocate CPU-bound tasks to fulfilits performance requirement. We also observe that platformscaling is the major source of performance improvement. Incase of the Star topology, the performance boost in the thirditeration is significantly larger (10.3 x) than the previous one,as it involves a new machine to be added rather than merelyadjusting the operator parallelism and task allocation.

There are two reasons why the task profiles tend to beunderestimated in the first place. (1) The overhead of DSMSand operating system to run the streaming application hasnot been explicitly considered in our resource consumptionmodel due to the complexity of formulation and measure-ment. Therefore, as the throughput grows, the increasingoverheads of acknowledgement and thread schedulingneed to be amortized onto the unit resource consumption.(2) Some operators, especially those performing batch proc-essing or window slide at set intervals, demonstrate aperiodical spiky pattern of resource usages even when proc-essing a steady size of stream. Though we should allocateresources to satisfy the peak need of the operator, the spikyperiod is very short and thus hard to be captured accu-rately. As a compromise, we enlarge the average value intask profiles so as to ensure the smooth flow of execution.

Another finding from these experiments is that the com-plete latency consistently grows as the throughput increases.In our measurement, the average complete latency of theLinear topology is 89 ms in the first iteration, but it raises to159 ms after the deployment process is finalized. We under-stand the result in the sense that the complete latency isstrongly correlated to the number of unacknowledged tuplesthat are allowed in the system. As there is no latency viola-tion identified in these experiments, P-Deployer currentlydoes not have the ability to throttle the data source.

TABLE 4Parameter Settings Used in the Deployment Planning Model

Parameter Value

a (Upper limit on CPU usages for I/O bound task) 0.1b (Upper limit on memory usages for Mem-bound task) 512 (MB)ar (Coefficient in Eq. (6)) 1

13. http://storm.apache.org/releases/current/FAQ.html14. From left to right, each number corresponds to the number of

tasks for each operator in the Linear topology.


http://storm.apache.org/releases/current/FAQ.html

However, such function can be readily implemented usingthe built-in rate-limitingmechanism.15

7.3 Scalability Evaluation

Two separate experiments have been conducted to evaluatethe scalability of P-Deployer. In the first experiment, theapplication to be deployed is the Linear topology that con-sists of various operator types. We further extend the depthof the topology to 5 and 10 so as to generate more compli-cated topology structures, but the performance target ofdeployment remains the same as the applicability evaluation.

In the second experiment, we attempt to deploy the Twit-ter Sentiment Analysis towards higher throughput targets of3000 tweets/second and 6000 tweets/second, respectively.

Fig. 11 shows that the increasing depths of the Lineartopology have little impact on the convergence speed of theP-Deployer, which is the number of iterations required toachieve the desired performance. P-Deployer consistentlyimproves the throughput performance by gradually scalingout the task distribution and solving any bottlenecks causedby resource contention.

However, it takes more efforts for the Twitter SentimentAnalysis to realize a higher performance target and thenumber of used machines does not linearly scale with it(reaching 3000 tweets/seconds requires 5 nodes while real-ising 6000 tweets/seconds requires 14 nodes). We alsoobserve a throughput oscillation at the iteration 6 when tar-geting at the higher throughput. This result is not beyondour expectation due to the fact that P-Deployer works on abest-effort basis and thus cannot guarantee the performancebottleneck to be necessarily solved by the adjustment ofdeployment. Some DSMS concurrency settings, such as thesize of the thread pool and the number of the acker tasks,are also influencing the topology behaviour [14], but theyhave not been considered in P-Deployer due to the hardnessof generalization. Despite this, P-Deployer is still qualifiedto be a practical deployment tool in the situation where thedefault settings of DSMS suffice for the performance goal.

During the deployment process inwhich the Twitter Senti-ment Analysis reaches the 6000 tweets/second target, weassessed P-Deployer’s runtime overhead, including the pro-filing cost (the percentage decrease in average throughputafter enabling the profilingmechanism) and the running timeof the task mapping and resource estimation algorithm. Theresult is shown in Table 5,which demonstrates that our profil-ingmechanism is low-overhead and the FFDheuristic is suffi-ciently fast for solving the overlapping bin-packing problem.

7.4 Cost Efficiency Evaluation

7.4.1 Comparable Method

We choose a state-of-the-art platform-oriented approach—Metis scheduler as our comparable method, which outper-forms the greedy scheduler [1] in major metrics such asthroughput and resource utilization [7]. In general, the Metisscheduler uses the number of machines in the platform as theparallelism degree of each operator; it then builds the taskcommunication graph based on the monitoring of internalstreams and translates the taskmapping problem into a graphpartitioning problem. Metis16 is used as an external servicethat solves the partitioning problemwith a goal to balance theCPUusage and bandwidth consumption across the platform.

7.4.2 Result and Analysis

In order to make the result comparable, we control the per-formance target of deployment and compare P-Deployerand the Metis scheduler against the minimal number ofmachines required to achieve the same target. The test appli-cations include the Word Count topology and the Micro-benchmark topologywith different types of operators.

Fig. 12 demonstrates that, in most cases, P-Deployeroccupies less resources than the Metis scheduler does.As seen in the deployment of the Word Count topology,P-Deployer is able to use 1 less Worker Nodes than that ofMetis scheduler to reach the 6000 tweets/second target.

Fig. 10. The monitored throughputs of topologies in the applicability evaluation. Each MAPE iteration corresponds to a plotted box that contains 10readings of throughputs. P-Deployer finalizes the deployment once the performance target denoted by the horizontal line is reached.

Fig. 11. The monitored throughputs of topologies in the scalability evaluation. Each MAPE iteration corresponds to a plotted box that contains 10readings of throughputs. P-Deployer finalizes the deployment once the performance target denoted by the horizontal line is reached.

15. In Apache Storm, we refer tomax.spout.pending option for details. 16. http://glaros.dtc.umn.edu/gkhome/views/metis


http://glaros.dtc.umn.edu/gkhome/views/metis

When the Metis scheduler is applied, we observed that thecapacities17 of the last three operators reached up to 0.808,0.494, and 0.505, respectively, but the CPU utilization ofeach Worker Node barely exceeded 35 percent. This leadsto a conclusion that the operator parallelism is underesti-mated using the number of available machines, and theinsufficient operator parallelization would in turn impedethe high utilization of the underlying platform.

On the other hand, disregarding the operator type in taskmapping also caused significant performance degradation.Wehave compensated the insufficient parallelism (multipliedby 5) for Sojourn time-bound operators when using the Metisschedule to deploy the synthetic topologies, yet still the costof processing is noticeably higher than that of P-Deployer,with 2 � 3 more machines required in each case to reach thehighest target. The performance disparity is attributed to loadimbalance that resulted from the inaccurate workloadmodel-ling. The Metis scheduler only approximates the WorkerNode load by counting the number of tuples being transmit-ted, as compared to our approach that first models theresource consumption on the fine-grained task level, andthen deduces the workload of each machine by additivelyaccruing the resource consumption of collocating tasks.

8 RELATED WORK

There is a rich body of literature on the deployment ofstreaming applications, each associated with different opti-mization targets.

Some papers aim to improve the throughput and resourceutilization. Fisher et al. employed Bayesian optimization totune the operator parallelization and other configurationsfor achieving higher throughputs [14]. However, treating thestreaming application as a blackbox function does not prop-erly take advantage of the domain knowledge, and it resultsin a comparatively long convergence time. The Metis sched-uler, proposed by the same group, avoids the convergenceproblem by building up a task communication graph andsolving it with full-fledged partitioning software [7]. Never-theless, as shown in our experiments, the resource utilizationand workload balancing could still suffer without consider-ing the operator type and parallelism holistically. There isalso a resource-aware scheduler that allows users to specifythe resource demand for each task in order to optimizework-load distribution and inter-node communication [9]. How-ever, our experience has suggested that the resourcedemand of a task strongly correlates to its stream load andcommunication pattern, which can only be obtained at run-time through application profiling.

Some works, on the other hand, investigated adaptivedeployment to build elastic streaming applications. Tomaintain high resource utilization, Vanderveen et al. [15]

employed MAPE loops to elastically scale the streamingapplication along with the workload changes. But theadopted threshold method much resembles the auto-scalingpolicy in Amazon Web Services (AWS) and is inadequate tomodel the particularity of target applications. To honour thelatency constraint during scaling, Fu et al. [8] proposed adynamic resource scheduler that tailors the deploymentplan to the fluctuating workload. The performance model isa rigorous queueing network that ignores network trans-mission costs, which makes it only applicable to computa-tionally intensive streaming applications.

Workload estimation and latency modelling has also beendiscussed in the literature to make the adaptation processmore pro-active, yet still the issue of cost-efficient scalingremains without optimizing the deployment holistically.Zacheilas et al. [16] predicted the workload characteristics inorder to choose the appropriate transitions on parallelism, butthey did not consider the operator communication pattern toreduce network overhead. Li et al. [4] presented a predictivescheduling framework, utilizing Support Vector Regressionto predict the complete latency under certain deploymentarrangement. But they left out operator parallelization, andthe proposed scheduling algorithm is essentially an exhaus-tive search to traverse all feasible solutions. Nevertheless, theestablished optimization techniques can be integrated with P-Deployer to choose the right time of scaling andminimize thenegative impacts of dynamic workload adaptation. With theknowledge of predicted workload distribution, Mayeret al. [17] investigated dynamic stream partitioning usingqueueing theory models and time series analysis. Their workcan be integrated with our operator parallelization model todynamically adjust the operator parallelism, providing proba-bilistic guarantees on the buffering limit. By modelling thelatency spike created by operator movements, Heinzeet al. [18] proposed an elastic placement algorithm thatreduces the number of latency violations. This work can beused in tandemwith P-Deployer, replacing the proposed bin-packing model that solely commits to reducing the resourcecost. Cardellini et al. [19] investigated the optimal operatorplacement as an Integer Linear Programming problem. Thecomputed result can be used to evaluate the quality of our

TABLE 5The Profiling Cost (Pc) and the Running Time of Algorithm 1 (Rt)

When Deploying the Twitter Sentiment Analysis Topology

Iteration 1 2 3 4 5 6 7

PcPc 2.68% 3.14% 3.48% 4.11% 3.72% 2.95% 3.3%RtRt 0.213 0.245 0.312 0.338 0.331 0.351 0.359

Fig. 12. Comparison of P-Deployer and the Metis scheduler against theresource costs required to reach the same performance target.

17. Capacity represents the percentage of the time in the observationtime window that the operator spent executing inputs.


heuristic solution. Also, to speed up the process of findingoptimal solutions to resource allocation/placement problems,evolutionary algorithms [20], [21] andmachine learning basedframeworks [4], [22] have been investigated in the literature.

There are also several papers on reducing inter-nodecommunication [1], [6], [10], [19] and improving systemmanageability and energy-efficiency [2], [23]. However, allthe above-mentioned efforts fall short when the deploymentof streaming application has been commissioned with a spe-cific performance target, the case of which is commonlyseen in the cloud computing context.

9 CONCLUSIONS AND FUTURE WORK

In this work, we proposed P-Deploy - an automated, cost-efficient and performance-oriented deployment frameworkfor running streaming applications on cloud. Inspiredby the observations that drawn from real experiments,P-Deployer adopts a Monitor-Analyze-Plan-Execute archi-tecture working in a profiling environment that graduallyscales the streaming application to approach its perfor-mance target. In each iteration, P-Deployer builds the rela-tionship between performance behaviour and resourceconsumption at the fine-grained task level, decides the oper-ator parallelism based on its profile and performancedemand, and then solves the task mapping and resourceestimation problem as a variant of bin-packing. The evalua-tions demonstrated that P-Deployer is able to deploy a vari-ety of streaming applications towards their performancetarget, and it outperforms the state-of-the-art platform-ori-ented approach in terms of cloud resource usages. Since thecloud environment benefits from the emerging architectureused in data centres, it is of crucial importance to make thedeployment process aware of the hardware heterogeneityand allow customized resource provisioning that considersapplication characteristics and requirements. P-Deployer isour first attempt to advance the research of this topic.

In the future, we plan to extend P-Deployer with the abil-ity to dynamically adapt streaming applications to theworkload fluctuations, which includes autonomously add-ing/releasing resources according to the varying perfor-mance requirement, and taking advantage of heterogeneousresources to exhibit full auto-scaling characteristics.

REFERENCES

[1] L. Aniello, R. Baldoni, and L. Querzoni, “Adaptive online sched-uling in storm,” in Proc. 7th ACM Int. Conf. Distrib. Event-BasedSyst., 2013, pp. 208–218.

[2] P. Basanta-Val, N. Fern�andez-Garc�ıa, A. Wellings, and N. Auds-ley, “Improving the predictability of distributed stream process-ors,” Future Generation Comput. Syst., vol. 52, pp. 22–36, 2015.

[3] L. Eskandari, Z. Huang, and D. Eyers, “P-Scheduler: Adaptivehierarchical scheduling in apache storm,” in Proc. AustralasianComput. Sci. Week Multiconference, 2016, pp. 1–10.

[4] T. Li, J. Tang, and J. Xu, “A predictive scheduling framework forfast and distributed stream data processing,” in Proc. IEEE Int.Conf. Big Data, 2015, pp. 333–338.

[5] Y. Liu, X. Shi, and H. Jin, “Runtime-aware adaptive scheduling instream processing,” Concurrency Comput.: Practice Experience,vol. 28, pp. 1–14, 2015.

[6] J. Xu, Z. Chen, J. Tang, and S. Su, “T-storm: Traffic-aware onlinescheduling in storm,” in Proc. IEEE Int. Conf. Distrib. Comput.Syst., 2014, pp. 535–544.

[7] L. Fischer and A. Bernstein, “Workload scheduling in distributedstream processors using graph partitioning,” in Proc. IEEEInt. Conf. Big Data, 2015, pp. 124–133.

[8] T. Z. Fu, J. Ding, R. T. Ma, M. Winslett, Y. Yang, and Z. Zhang,“DRS: Dynamic resource scheduling for real-time analytics overfast streams,” in Proc. IEEE Int. Conf. Distrib. Comput. Syst., 2015,pp. 411–420.

[9] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell,“R-storm: Resource-aware scheduling in storm,” in Proc. 16thAnnu. Middleware Conf., 2015, pp. 149–161.

[10] A. Chatzistergiou and S. D. Viglas, “Fast heuristics for near-opti-mal task allocation in data stream processing over clusters,” inProc. 23rd ACM Int. Conf. Inf. Knowl. Manage., 2014, pp. 1579–1588.

[11] M. R. Garey and D. S. Johnson, Computers and Intractability:A Guide to the Theory of NP-Completeness. San Francisco, CA, USA:Freeman, 1979.

[12] A. S. Fukunaga and R. E. Korf, “Bin-completion algorithms formulticontainer packing and covering problems,” J. Artif. Intell.Res., vol. 28, pp. 393–429, 2007.

[13] J. Desrosiers and M. E. L€ubbecke, “Branch-price-and-cut algo-rithms,” in Wiley Encyclopedia of Operations Research and Manage-ment Science. Hoboken, NJ, USA: Wiley, 2011, pp. 1–18.

[14] L. Fischer, S. Gao, and A. Bernstein, “Machines tuning machines:Configuring distributed stream processors with Bayesian opti-mization,” in Proc. IEEE Int. Conf. Cluster Comput., 2015, pp. 22–31.

[15] J. S. V. D. Veen, B. V. D. Waaij, E. Lazovik, W. Wijbrandi, andR. J. Meijer, “Dynamically scaling apache storm for the analysis ofstreaming data,” in Proc. IEEE 1st Int. Conf. Big Data Comput. Ser-vice Appl., 2015, pp. 154–161.

[16] N. Zacheilas, V. Kalogeraki, N. Zygouras, N. Panagiotou, andD. Gunopulos, “Elastic complex event processing exploiting pre-diction,” in Proc. IEEE Int. Conf. Big Data, 2015, pp. 213–222.

[17] R. Mayer, B. Koldehofe, and K. Rothermel, “Predictable low-latency event detection with parallel complex event processing,”IEEE Internet Things J., vol. 2, no. 4, pp. 274–286, Aug. 2015.

[18] T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer, “Latency-aware elastic scaling for distributed data stream processing sys-tems,” in Proc. 8th ACM Int. Conf. Distrib. Event-Based Syst., 2014,pp. 13–22.

[19] V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, “Optimaloperator placement for distributed stream processing applic-ations,” in Proc. 10th ACM Int. Conf. Distrib. Event-Based Syst., 2016,pp. 69–80.

[20] Y. Gao, H. Guan, Z. Qi, Y. Hou, and L. Liu, “A multi-objective antcolony system algorithm for virtual machine placement in cloudcomputing,” J. Comput. Syst. Sci., vol. 79, no. 8, pp. 1230–1242, 2013.

[21] E. Feller, L. Rilling, and C. Morin, “Energy-aware ant colonybased workload placement in clouds,” in Proc. IEEE 12th Int. Conf.Grid Comput., 2011, pp. 26–33.

[22] X. Zuo, G. Zhang, and W. Tan, “Self-adaptive learning PSO-baseddeadline constrained task scheduling for hybrid IaaS cloud,” IEEETrans. Autom. Sci. Eng., vol. 11, no. 2, pp. 564–573, Apr. 2014.

[23] T. Matteis and G. Mencagli, “Keep calm and react with foresight:Strategies for low-latency and energy-efficient elastic data streamprocessing,” in Proc. 21st ACM Symp. Principles Practice ParallelProgram., 2016, pp. 1–12.

Xunyun Liu received the BE and ME degreesin computer science and technology from theNational University of Defense Technology, in2011 and 2013, respectively. He is currently work-ing toward the PhD degree in computer science atthe University of Melbourne. His research interestsinclude streamprocessing and distributed systems.

Rajkumar Buyya is professor of computer sci-ence and software engineering. He is also a futurefellow of the Australian Research Council and thedirector of the Cloud Computing and DistributedSystems (CLOUDS) Laboratory, University ofMelbourne, Australia. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


performance-oriented deployment of streaming applications ... · streaming applications to run on...

Documents