toward reliable and rapid elasticity for streaming ... · toward reliable and rapid elasticity for...

11
Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational and Data Sciences Indian Institute of Science, Bangalore 560012, India Email: [email protected], [email protected] Abstract—The pervasive availability of streaming data is driv- ing interest in distributed Fast Data platforms for streaming applications. Such latency-sensitive applications need to re- spond to dynamism in the input rates and task behavior using scale-in and -out on elastic Cloud resources. Platforms like Apache Storm do not provide robust capabilities for respond- ing to such dynamism and for rapid task migration across VMs. We propose several dataflow checkpoint and migration approaches that allow a running streaming dataflow to migrate, without any loss of in-flight messages or their internal tasks states, while reducing the time to recover and stabilize. We implement and evaluate these migration strategies on Apache Storm using micro and application dataflows for scaling in and out on up to 2 - 21 Azure VMs. Our results show that we can migrate dataflows of large sizes within 50 sec, in comparison to Storm’s default approach that takes over 100 sec. We also find that our approaches stabilize the application much earlier and there is no failure and re-processing of messages. 1. Introduction The rapid growth of observational and streaming data on physical systems and online services is driving the need for applications that operate on these streams in near real-time. Traditional stream sources, like micro-blogs and social networks, financial transactions and web logs, are being complemented by sensor observations from Internet of Things (IoT) domains, such as smart power grids, personal fitness devices, and autonomous vehicles. Such streams are analyzed for live visualization in dashboards or to perform online decision making, such as to trigger load curtailment in power grids or to alert users to suspicious transactions [1]. Stream processing applications need to operate with low latency to respond rapidly to evolving situations, and need to scale with the number and the rate of the streams. Fast data platforms, also called Distributed Stream Processing Systems (DSPS), offer composition environment to design such applications as a dataflow graph of user-defined tasks, and execute them on distributed resources such as com- modity clusters and Clouds. Frameworks such as Apache Storm, Flink and Spark Streaming are popular to support this velocity dimension of Big Data [2]. Streaming applications are sensitive to dynamism – be it changes in the input stream due to sampling, in the tasks’ behavior and resource requirements, or the Virtual Machine’s (VM) performance due to multi-tenancy. Such variations can cause the dataflow’s performance (e.g., la- tency, supported throughput) to be affected, and violate the application’s Quality of Service (QoS) requirement. While fast data platforms are designed to scale, they are less responsive to such dynamism and have limited ability to change the dataflow’s schedule at runtime. But this feature is essential to leverage the elasticity offered by Cloud VMs to respond to changing situations, and to efficiently utilize pay-as-you-go Cloud resources, say, by consolidating tasks from many under-utilized VMs to fewer well-utilized VMs. E.g., Apache Storm, a popular open-source stream pro- cessing platform from Twitter, uses the R-Storm scheduler for resource-aware scheduling when the dataflow is submit- ted [3]. However, any changes in the stream, dataflow or VMs’ performance needs the user’s intervention to “rebal- ance” the placement of tasks onto the same or a different set of VMs. A key challenge is to enact this rebalance such that: (1) there is no loss of messages or task states, and (2) it is done rapidly with minimal turn around time 1 . The former ensures consistency and reliability, while the latter is important for mission-critical applications that cannot suffer prolonged down-time during this rebalance. Both of these are lacking in contemporary DSPS. They require the dataflow to be halted before rebalancing it, caus- ing message and task state loss in the process. Platforms like Storm and Flink have robust mechanisms for replaying lost messages, and for regularly checkpointing the state of tasks in the dataflow. These can be leveraged to ensure reliability after rebalance. However, these fault-tolerance methods tend to be disruptive – message or machine loss are less frequent, while planned rebalance can happen more frequently. Prior research like ElasticStream [5] plan to minimize Cloud costs by dynamically adjusting the resources required with input rates, while Stela [6] does on-demand scaling for Storm while optimizing the throughput and limiting application interruption. Others [7] perform incremental migration to maintain states while scaling, minimizing the amount of 1. A related but separate problem is to determine the new resource allocation for the dataflow (number and sizes of VMs) and the new mapping of its tasks onto the VMs. This is outside the scope of this paper, but has been examined elsewhere [4]. Having a new schedule is a precursor to the dynamic enactment of the schedule, which we target in the current paper. arXiv:1712.00605v1 [cs.DC] 2 Dec 2017

Upload: others

Post on 21-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds

Anshu Shukla and Yogesh SimmhanDepartment of Computational and Data Sciences

Indian Institute of Science, Bangalore 560012, IndiaEmail: [email protected], [email protected]

Abstract—The pervasive availability of streaming data is driv-ing interest in distributed Fast Data platforms for streamingapplications. Such latency-sensitive applications need to re-spond to dynamism in the input rates and task behavior usingscale-in and -out on elastic Cloud resources. Platforms likeApache Storm do not provide robust capabilities for respond-ing to such dynamism and for rapid task migration acrossVMs. We propose several dataflow checkpoint and migrationapproaches that allow a running streaming dataflow to migrate,without any loss of in-flight messages or their internal tasksstates, while reducing the time to recover and stabilize. Weimplement and evaluate these migration strategies on ApacheStorm using micro and application dataflows for scaling in andout on up to 2− 21 Azure VMs. Our results show that we canmigrate dataflows of large sizes within 50 sec, in comparisonto Storm’s default approach that takes over 100 sec. We alsofind that our approaches stabilize the application much earlierand there is no failure and re-processing of messages.

1. Introduction

The rapid growth of observational and streaming dataon physical systems and online services is driving theneed for applications that operate on these streams in nearreal-time. Traditional stream sources, like micro-blogs andsocial networks, financial transactions and web logs, arebeing complemented by sensor observations from Internet ofThings (IoT) domains, such as smart power grids, personalfitness devices, and autonomous vehicles. Such streams areanalyzed for live visualization in dashboards or to performonline decision making, such as to trigger load curtailmentin power grids or to alert users to suspicious transactions [1].

Stream processing applications need to operate with lowlatency to respond rapidly to evolving situations, and needto scale with the number and the rate of the streams. Fastdata platforms, also called Distributed Stream ProcessingSystems (DSPS), offer composition environment to designsuch applications as a dataflow graph of user-defined tasks,and execute them on distributed resources such as com-modity clusters and Clouds. Frameworks such as ApacheStorm, Flink and Spark Streaming are popular to supportthis velocity dimension of Big Data [2].

Streaming applications are sensitive to dynamism – beit changes in the input stream due to sampling, in the

tasks’ behavior and resource requirements, or the VirtualMachine’s (VM) performance due to multi-tenancy. Suchvariations can cause the dataflow’s performance (e.g., la-tency, supported throughput) to be affected, and violate theapplication’s Quality of Service (QoS) requirement. Whilefast data platforms are designed to scale, they are lessresponsive to such dynamism and have limited ability tochange the dataflow’s schedule at runtime. But this featureis essential to leverage the elasticity offered by Cloud VMsto respond to changing situations, and to efficiently utilizepay-as-you-go Cloud resources, say, by consolidating tasksfrom many under-utilized VMs to fewer well-utilized VMs.

E.g., Apache Storm, a popular open-source stream pro-cessing platform from Twitter, uses the R-Storm schedulerfor resource-aware scheduling when the dataflow is submit-ted [3]. However, any changes in the stream, dataflow orVMs’ performance needs the user’s intervention to “rebal-ance” the placement of tasks onto the same or a differentset of VMs. A key challenge is to enact this rebalance suchthat: (1) there is no loss of messages or task states, and(2) it is done rapidly with minimal turn around time 1. Theformer ensures consistency and reliability, while the latter isimportant for mission-critical applications that cannot sufferprolonged down-time during this rebalance.

Both of these are lacking in contemporary DSPS. Theyrequire the dataflow to be halted before rebalancing it, caus-ing message and task state loss in the process. Platforms likeStorm and Flink have robust mechanisms for replaying lostmessages, and for regularly checkpointing the state of tasksin the dataflow. These can be leveraged to ensure reliabilityafter rebalance. However, these fault-tolerance methods tendto be disruptive – message or machine loss are less frequent,while planned rebalance can happen more frequently. Priorresearch like ElasticStream [5] plan to minimize Cloud costsby dynamically adjusting the resources required with inputrates, while Stela [6] does on-demand scaling for Stormwhile optimizing the throughput and limiting applicationinterruption. Others [7] perform incremental migration tomaintain states while scaling, minimizing the amount of

1. A related but separate problem is to determine the new resourceallocation for the dataflow (number and sizes of VMs) and the new mappingof its tasks onto the VMs. This is outside the scope of this paper, but hasbeen examined elsewhere [4]. Having a new schedule is a precursor to thedynamic enactment of the schedule, which we target in the current paper.

arX

iv:1

712.

0060

5v1

[cs

.DC

] 2

Dec

201

7

Page 2: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

5x2

-co

re V

Ms

2x4

-co

re V

Ms

A B D E G

C F

S

S

A B

D

E G

C F

S

S

Figure 1: Example of migration of Star dataflow from 5×2-core VMs to 2× 4-core VMs. ‘s’ indicates a stateful task.

state transfer between hosts. But none of these addressmessage reliabilty and state handling during the migration.

In this paper, we propose mechanisms to dynamicallyenact the rescheduling and migration of tasks in a streamingdataflow from one set of VMs to another, reliably andrapidly. Specifically, we make the following contributions:

1) We discuss current rebalance capabilities of Storm, asan exemplar DSPS, and motivate the need for betterapproaches to migration of streaming dataflows (§ 2).

2) We propose two novel migration strategies, Drain-Checkpoint-Restore (DCR) and Capture-Checkpoint-Resume (CCR) (§ 3), besides a baseline approach, thatare implemented on Storm.

3) We introduce metrics to evaluate these strategies (§ 4),and evaluate the performance of the approaches forrealistic dataflows on Storm within Azure Cloud (§ 5).

Finally, related work is reviewed in § 6 and our conclu-sions and future work presented in § 7.

2. Background and Motivation

In this section, we motivate the need for dynamic mi-gration of streaming applications across elastic Cloud re-sources. We also provide background on Apache Storm’sreliability and rebalancing capabilities, as a representativeDSPS, highlighting the gaps in the existing capabilities.

Streaming dataflow applications are composed as adirected graph of tasks, with event stream(s) initiated at oneor more source tasks from external sources, processed andstreamed through additional tasks, and finally terminatingin one or more sink tasks that may persist or publish theoutput stream(s). Fig. 1 shows a Star dataflow with 7 tasks,A−G, with one source (green, A) and one sink task (red,G) [3]. These tasks are active all times, and the user-logic inthe task is executed for each event as it arrives on the inputstream. Tasks can be stateful, where its execution dependson and can update a local in-memory state. E.g., B and Fare stateful tasks, and may maintain, say, a count of eventsseen or a windows of events for aggregation.

DSPS like Storm coordinate the resource allocation,placement and execution of the dataflow on distributedresources. Typically, VMs available within the shared DSPScluster are divided into logical resource slots, and tasks ofthe dataflow are placed in these slots at deployment time.Multiple instances of a tasks may also be present, based onthe degree of data parallelism required, and can share thesame slot. Fig. 1(left) shows the 7 tasks of the Star dataflowwith one instance each being placed on 7 slots spread across5 VMs, each having 2 slots of 1-core each.

Dynamism in the input event rates, resources consumedby the tasks, QoS requirements of the application, or theperformance of the VMs can cause the initial resource allo-cation (number and size of VMs) and placement (mappingof tasks to slots) to become sub-optimal. Then, some orall tasks in the dataflow will need to be migrated to otherslots in an independent or an intersecting set of VMs. Thisincludes consolidation to fewer VMs to reduce costs andimprove locality/latency, scale-out to more VMs to respondto increased resource needs, or load-balancing the tasks onthe same set of VMs. E.g., Fig. 1 shows a scaling-in ofthe 7 tasks from 5 × 2-core VMs with 70% utilization to2 × 4-core VMs with a 87.5% utilization, a lower billingcost, and also a lower latency due to fewer network hops.Such rebalancing needs are frequent for latency sensitivestreaming applications running on elastic Cloud resourcesthat are paid for by the minute. Two key requirements whenperforming such rebalancing are the reliability of messagesand task states, and the rapidity of completing the migrationso that the new deployment stabilizes.

Current DSPS expect the users to decide when toperform such a rebalance, and this typically requires thedataflow to be stopped and restarted with the updatedschedule. E.g., Storm’s rebalance command allows usersto scale-out or -in the number of worker slots assignedto a running dataflow. However, tasks that are being mi-grated will loose their state and any messages in their inputqueue. Users can specify a timeout duration, during whichStorm pauses the source task(s) so that no new messagesare emitted, and in-flight messages may flow through thedataflow.Users may under- or over-estimate this timeout,causing messages to be lost or the dataflow to be idle,respectively. After the timeout, tasks being migrated arekilled and respawned on the new workers, and the sourcetasks resume generating the input stream. Meanwhile, tasksnot being migrated continue to execute while bufferingmessages in their queues.

There are two capabilities of Storm that can mitigate theimpact of lost messages and task states: message “acking”and checkpointing. These can ensure reliability but not per-formance. Storm can guarantee at least once message pro-cessing using an Acknowledgment Service 2. Each eventgenerated at the source task registers its 64-bit unique eventID with this acking service and this forms the root of acausal tree that is maintained. This event is also temporar-ily cached at the source. Any downstream events causallygenerated by tasks processing this root event will add theirID to the tree and acknowledges processing of their parentevent by the task. The tree itself is compactly maintainedby the service using an XOR hash of the each event ID withthe root ID, once when the event is added and once when itis acknowledged. Hence, when all causal events are acked,the tree’s hash will become zero as each ID is XORed twice.Storm checks if the hash for a root event has not becomezero within a specified timeout, upon which the event isreplayed by the source task. Events whose hashes become

2. Guaranteeing Message Processing, Apache Storm Version: 1.0.3

Page 3: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

zero are periodically discarded from the cache.The recent Storm versions since v1.0.3 support

checkpointing and recovery of the state of tasks 3. Usersexplicitly implement stateful tasks with interfaces to acquireand restore its state. The framework uses these for periodicdistributed checkpointing of tasks, similar to a three-phasecommit. A special source task sends a wave of check-point messages that flow through the dataflow and triggersthese methods in each task, causing state transitions. ThePREPARE message is sent when a wave is starting, andeach task’s prepare method assembles its internal state. Onceall PREPARE messages are acked, the COMMIT messageis sent and causes the task’s commit method to persist theprepared state to an external key-value store (Redis). AROLLBACK message is sent if the prepare message was notacked for any task. Once committed, the checkpointed statescan be restored in future by sending an INIT message.On receiving this, a task’s init method is passed the lastcommitted state from the external store.

Using these two features, one can perform reliable re-balancing out-of-the-box with in Storm, which we term asDefault Storm Migration (DSM). Here, we can initiateStorm’s rebalance command immediately on user re-quest for the new schedule. This will kill all migrating tasksand cause in-flight events to be lost, but the acking servicewill replay the lost events once the rebalance is completed.Similarly, the checkpointing service will restore the tasks’states from the last periodic checkpoint after the tasks aremigrated.

While one is assured of reliability, this comes at thecost of performance when using DSM. The number of lostevents and state restoration can be disruptive, and delay thedataflow resuming its stable execution. The dataflow snap-shot effectively rolls back to the older of the last successfullyprocessed message or the last successful checkpoint. Thegranularity of event recovery is a root event in the causaltree. So, once the migration is complete, the root events forall in-flight events that were lost will be replayed from thesource task and even causal events that were successfullyprocessed earlier will be regenerated and reprocessed. Thisalso means that the old replayed events will be interleavedwith the new events being generated by the source tasksimmediately after the dataflow is restarted and the sourcetasks unpaused. While Storm and most DSPS do not guar-antee event ordering, this interleave will cause a significantnumber of events to be out of order.

Both acking and checkpointing need to be on all thetime. Acking is done for all events, and the checkpointinterval is periodic (30 secs, by default) and has to beconfigured to balance operational costs and rollback lossfor a dataflow. Hence, they also pose additional overheads ifthe fault-tolerance is a concern only during active migrationand not during regular operations [8], [9]. This can bepunitive during normal operations if the input rates arehigh [10]. One advantage of DSM is that the new scheduleis initiated immediately by killing the migrating tasks, with

3. Storm State Management, Apache Storm, Version: 1.0.3-SNAPSHOT

Checkpoint Source

Acker Service REDISPlatform

LogicUser Logic

Platform Logic

User Logic

Platform Logic

User Logic

User Source

Task 1 Task 2 (Before Migration) Task 2 (After Migration)

Migrate

Pause User Sink

PREPARE

PREPAREAck PREPARE

Ack PREPAREAll Prepared

COMMIT

Process in-flight

messages

Process in-flight

messages

Persist State

Get State

Assemble Task State

COMMIT

Get StateAssemble Task State

Ack COMMIT Persist State

Ack COMMITAll Committed

Storm activates

new tasks, Rewires

dataflowStorm

deactivates old tasks

INIT Retrieve State

Restore State

Ack INITINIT

Restore StateRetrieve State

Ack INITAll Initialized

Unpause User Sink

Rebalance Phase

Commit Phase Starts

Drain Phase Starts

Restore Phase Starts

Figure 2: Sequence diagram for DCR migration operations

the consequences on recovery pushed to after the rebalancehas completed.

3. Dataflow Migration Strategies

In this section, we propose two approaches that addressthe performance limitations of the baseline DSM strategy.These actively manage the checkpointing, acking and re-balancing to improve the efficiency and timeliness of thedataflow migration, while still guaranteeing reliability. Wediscuss these conceptually as generic strategies for DSPS,and offer linkages specific to Storm by extending the frame-work’s capabilities.

3.1. Drain, Checkpoint and Restore (DCR)

Conceptually, our DCR strategy performs three opera-tions to addresses some of the performance limitations ofDSM. One key intuition is to pause the source tasks’ exe-cution and let the in-flight messages execute to completionacross the dataflow. This effectively drains the dataflowwithout any loss of messages before the tasks are migrated,and there are no failed messages to be replayed later. Further,DCR also performs a just-in-time (JIT) checkpointing waveafter the drain and before the task migrations are done, ratherthan be enabled periodically. This ensures that the latest stateis checkpointed, and we restore the most recent state afterthe rebalance. Lastly, message reliability is enabled only forthe checkpoint messages rather than for all dataflow events.The last two also avoid the overheads for reliability if theuser does not require them for normal operations.

In the context of Storm, the drain and checkpointingphases are slightly involved and are discussed here. Enablingacking only for reliability of checkpointing messages is sim-ply done by assigning an event ID to the checkpoint eventswhile emitting them for tracking by the acker service. Fig. 2shows the sequence diagram of operations performed as partof our DCR strategy and Fig. 3a shows the architectureinteractions.

When a migration enactment request is received from theuser, the schedule planning has already taken place and the

Page 4: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

Source Task 1 Task 2 Sink

OUT Q IN Q OUT Q IN Q OUT Q IN Q

State

Check-point

Source

StatePREPARE

REDISTask1 Task2 Sink

COMMIT

PREPARE

COMMIT

INIT

INIT

INIT

Zoom into Input Queue

E1E2E3PrCoIn

State

Pause/Unpause

In,Co,Pr

(a) DCR

Source Task 1 Task 2 Sink

OUT Q IN Q OUT Q IN Q OUT Q IN Q

Task State

Check-point

Source

PREPARE

REDISTask1 Task2 Sink

COMMIT

PREPARE

COMMIT

INIT

INIT

INIT

Zoom into Input Queue

Pause/Unpause

Pending Events

PREPAREINIT

COMMIT

TSPE TSPEPr

In

E1E2E3 PrCoIn E4

(b) CCR

Figure 3: Flow of user and checkpoint events, and operations for our two strategies, for a sequential dataflow with 4 tasks.

new mapping of tasks to VMs decided. We first pause thesource tasks from emitting new events. We override the logicfor the Checkpoint Source Task and initiate a checkpointwave by sending a PREPARE event. By default, these eventsflow along the same edges as the original dataflow. Sincewe have paused the source task, these PREPARE events willbe the last event in the input queue for every task in thedataflow. The input queue for each Storm worker is single-threaded. So when the task sees the PREPARE event, itknows that it has processed all in-flight data events andthe task holds the latest state. Intuitively, the PREPAREevent is the rearguard that sweeps behind the data eventsand guarantees that the dataflow has been drained.

User’s task logic extends a Storm platform task class,and this platform logic handles the checkpoint events. Whenthe PREPARE event is seen, the default platform logic callsthe user logic to retrieve a snapshot of the current task state.It then forwards the event to its downstream children, andthen acks the completion of its processing of the PREPAREevent. This happens for every task, and once it reaches thelast sink task, all in-flight events have been processed, alltask states have been snapshotted, and the acking servicenotified of this.

After the prepare is completed, our checkpoint taskinitiates a COMMIT event that flows through the dataflowin a similar manner. The receipt of this message causesthe platform logic in each task to persist the task’s statesnapshot to a Redis distributed store. This event too will beforwarded, and then acked by each task. When all COMMITevents have been acked, the checkpoint task invokes thenative rebalance command of Storm with a zero timeout.Tasks that need to be migrated are killed and restarted onnew slots, and rewired together to form the original dataflow.There will be no messages in-flight to be lost.

Once the rebalance completes, the checkpoint task sendsan INIT event through the dataflow to initialize the restartedtasks. This event is again forwarded through the dataflowand serves as the vanguard message in the rebalanceddataflow. When received, the platform logic in each task willretrieve the checkpointed task state from Redis and call theuser’s logic to restore the state. The INIT events are ackedafter forwarding as well. It may so happen that the task isnot ready when an INIT event is forwarded to it within the30 secs default acking timeout.To avoid rolling back therebalance when INIT events are not acked, the checkpoint

task emits duplicate INIT events after each 1 sec, and theplatform logic at a task skips processing this event if thetask has already restored its state. Once all tasks have ackedan INIT event, the source task is unpaused and resumesgenerating messages through the rescheduled dataflow.

DCR has several advantages over DSM. It avoids eventloss by draining the dataflow completely before the rebal-ance. As a result, there is no need to replay lost messages,which avoids their reprocessing costs. There is also a clearboundary between events processed by the dataflow beforeand after the migration, with no interleaving of old and newmessages. We also avoid costs of periodic checkpointing bydoing a JIT wave. One downside of the DCR approach isthe time spent in draining the dataflow, during which onlysome of the tasks are processing events while the rest areidle. This depends on the number of in-flight messages, andthe checkpoint can start only after these prior events areprocessed. Further, input events that are queue up within thesource tasks have to flow through the rebalanced dataflowto catchup. We address some of these in the next proposedstrategy, CCR.

3.2. Capture, Checkpoint and Resume (CCR)

We address the key limitation of the DCR approach,which is the time spent in draining the dataflow of all in-flight messages before the checkpoint starts. There are twoaspects to this drain time. (1) The checkpoint messages flowincrementally through the dataflow to guarantee that they arethe last event, and hence take additional time to reach thesink tasks from the source. (2) All in-flight messages have tobe fully processed by all dataflow tasks before the rebalancecan begin. Both of these incur additional latency.

We address the first challenge by directly broadcastingcheckpoint events from the source task to each task in thedataflow. This hub-and-spoke model allows the checkpointevents, specifically PREPARE and INIT, to be directlyplaced at the end of each task’s input queue. As a result,we avoid having to pass the event through all the precedingtasks and their processing logic. We retain the sequentialwiring for COMMIT events to ensure that all in-flight userevents in the input queues have been handled.

We address the second problem by processing only theone possible event that a task is currently executing andcapturing all other messages that are in its input queue,for every task in the dataflow. We also do not emit any

Page 5: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

events downstream after processing, and instead capture theoutput events as well. This effectively takes a snapshot ofall in-flight messages and pauses further flow through thedataflow. The most time that this will take is the time takenby the slowest task to drain is local input queue until thebroadcasted PREPARE message is seen. In contrast, DCRtakes time for every in-flight event across all task queues tobe processed by every downstream task in the dataflow. Thissnapshot of the input and output events by CCR for eachtask is appended to the state of the task, and restored intothe input and output queues after the dataflow is rebalanced.This allows the dataflow to resume execution of the in-flightevents.

These have to be carefully coordinated to guaranteeconsistency and reliability. We discuss these here, along withits implementation within Storm, and this is illustrated inFig. 3b. Besides the checkpoint source task, we also extendthe base platform logic for each task for the CCR strategy.

The series of steps are similar to DCR. When thedataflow starts, we wire the checkpoint source task to everytask in the dataflow as a broadcast channel. When the user’smigration request is received, we pause the source task(s)and send a PREPARE event on the broadcast channel toevery task. This event is received and placed at the endof the input queue. Each task continues to process userevents received ahead of the PREPARE in their input queue,and emit output events as well after processing. Hence, thePREPARE event may be at any position within the inputqueue.

When the PREPARE event is processed from the top ofthe queue by our platform logic in the task, we enable acapture flag. This flag ensures that future events seen onthe input queue are added to a pending event list withoutbeing processed, and this list is appended to the task’sstate. The PREPARE event is then acked, but not forwardeddownstream.

When the checkpoint task receives acks from all tasksfor the PREPARE event, it sends a COMMIT event, but aspart of the dataflow’s wiring. This causes the COMMIT tosweep through the dataflow and is guaranteed to be the lastevent in the input queue for every task. On receipt of thisevent, our platform logic at a task persists the task’s userlogic state as well as the pending event list to Redis. TheCOMMIT is forwarded downstream, and then acked. Once alltasks have acked the COMMIT event, we have successfullycaptured all in-flight messages and the user’s task state.

As for DCR, we then initiate Storm’s rebalance com-mand with zero timeout, and once that is done, broadcastan INIT event from the checkpoint task to all tasks in therebalanced dataflow. This will be the first event in the inputqueue for tasks in the dataflow. Now, the goal is not just torestore the tasks’ user logic state but also for the capturedevents to resume execution. When INIT event is seen, ourplatform logic at each task fetches the task’s state fromRedis. The user state is passed to the initializer of the usertask. The pending event list is then replayed locally to thetask logic for processing and the generated output events aresent to downstream tasks. The INIT event is acked by the

task and then all events in the pending list are processed.When all acks for the INIT event is received, we unpausethe source task(s) in the dataflow to resume generation ofnew events.

As can be seen, this CCR approach addresses the short-comings of the DCR strategy while retaining several of itsadvantages. We reduce the drain time of DCR significantly,allowing the rebalance to be enacted more quickly and withfewer messages to queue up at the source tasks. Intuitively,CCR overlaps the dataflow drain time of DCR with thedataflow refill time after rebalance to offer benefits. But itdoes not eliminate the drain time completely like DSM. Wedo pay additional overhead to send the state of the capturedevents to Redis and to restore them, but this is still cheaperthan replaying/reprocessing them and their ancestors in thecausal tree, like DSM does.

4. Performance Metrics

We propose several performance metrics that shouldbe considered when evaluating the effectiveness of thesestrategies. Since all these approaches guarantee reliabilityand consistency, that is not listed as a separate metric.1) Restore Duration: This is the time taken from the

start of the user-initiated migration request, to the firstmessage being seen in any of the sink tasks. Duringthis period, there will be no output events that comeout of the dataflow (i.e., output throughput is 0). This isapplicable to all strategies.

2) Drain/Capture Duration: This is the time durationfor the DCR and CCR strategies when the dataflow isbeing drained, and the task and message states are beingpersisted, after the migration request is received from theuser. After this duration, Storm’s rebalance commandis initiated. This time is not applicable (i.e., 0) for DSM.

3) Rebalance Duration: This is the time taken for Storm’srebalance command to complete. This initiates thekill of the tasks being migrated, and their redeploymenton new machines. When completed, tasks of the datafloware being started on the new VMs and waiting to beinitialized with INIT events.

4) Catchup time: This is the time point when all oldmessages that had entered the dataflow before the migra-tion was initiated have been successfully processed andemitted from the sink of the dataflow after its migration.This is relevant only for DSM and CCR, and not forDCR since it drains all old events before the migration.

5) Recovery time: This is the time point when all newmessages that enter the dataflow after migration andfailed due to timeouts have been successfully processedand emitted from the sink of the dataflow. After thispoint, we do not see any loss/recovery of new messagesdue to the migration. This is only applicable for DSM asthere will not be any failed messages in DCR and CCR.

6) Rate Stabilization time: This is the time point afterwhich the output message rate of the dataflow remainsstable after migration, and consistent with the expectedstable input rate. We define stability as being achieved

Page 6: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

32/sec Sink

8/sec

8/sec

Src

Diamond

8/sec 32/sec

8/sec

8/sec

16/sec

16/sec

16/sec Sink8/sec

8/sec

Src

Star8/sec 32/sec

8/sec

24/sec

Sink8/sec

8/sec

8/sec

8/sec8/sec

Src

8/sec

8/sec

8/sec

8/sec

8/sec

8/sec 8/sec

24/sec

24/sec

32/sec

Smart GridTraffic

8/sec

8/sec

8/sec

8/sec

Src

8/sec

8/sec

8/sec 8/sec

8/sec 24/sec

8/sec

Sink32/sec

8/sec

8/sec8/sec Sink8/sec8/sec8/secSrcLinear

8/sec8/sec

Figure 4: Micro and Application DAGs used in experiments. Cumulative input to each task is indicated within each.

when the observed output rate is sustained within 20%of the expected output rate for 60 secs. The start of thisstable time window indicates stabilization.

7) Message Loss/Recovery Count: This is the number ofmessages that were lost and recovered after migrationdue to killing the dataflow or acking timeouts.

5. Results

We validate our proposed migration approaches, DCRand CCR, on the Apache Storm DSPS, and compare it withthe default Storm migration (DSM).

Implementation. We implement the migration strate-gies on Storm v1.0.3. For DCR and CCR, our customcheckpoint source task is implemented by overriding theCheckpointSpout class. For CCR, we further modifythe TopologyBuilder class to automatically create thebroadcast wiring from the checkpoint source to all othertasks. We also modify the StatefulBoltExecutorclass that each user task extends, to support the capture andresume capabilities of CCR. For DSM, we enable periodiccheckpointing with the default interval of 30 secs, andenable acking for all events. We use Storm’s rebalancecommand with a timeout of 0 in all cases. Redis v3.2.8 isused to persist the checkpoints using native Storm bindings.

System Setup. A Storm Cluster is deployed on Mi-crosoft Azure D-series VMs in the Southeast Asia data cen-ter. The type and number of VMs vary with the experiment(Table 1) and range from 2−21 VMs with 1−4 cores each.Each resource slot of Storm runs a distinct task instance, andis assigned a 1-core Intel Xeon E5 v3 CPU @2.4 GHz with3.5GB RAM. A 50 GB SSD and 1 Gbps Ethernet is sharedby all slots. Redis runs on a seperate D3 VM with 4 cores.

Application Setup. We use two types of streamingdataflows in our experiments, micro-DAGs and applicationDAGs, as illustrated in Fig 4. The micro-DAGs capture com-mon dataflow patterns seen in Streaming applications, andare often used in literature [3], [6], [11]. Linear, Diamondand Star respectively capture a sequential flow, a fan-out/-in, and a hub-and-spoke pattern, with 5 user tasks each,besides a source and a sink. We use two application DAGswith structures based on real-world streaming applications.Traffic [12] analyzes the traffic patterns from GPS sensorstreams, and Grid [1] does predictive analytics over elec-

tricity meter and weather event streams from Smart PowerGrids. For simplicity and reproducibility, we use a dummytask logic with a sleep time of 100 millisecs for all thetasks since it is orthogonal to the behavior of the strategies.

All tasks have a selectivity of 1:1, i.e., one output event isgenerated for one input event. In our experiments, the sourcetask generates synthetic events at a fixed 8 events/sec rate,which is 20% less than the 10 events/sec peak supportedrate per task instance, given a 100 ms task latency. Theinput rate at each task goes up to 32 events/sec for ourDAGs (Fig 4). We assign one task instance (thread) for eachincremental 8 events/sec input rate to a task, with each taskinstance allocated one exclusive resource slot for execution.

Experiment Setup. We evaluate the migration strategiesfor the two most common elasticity scenarios in Clouds:scale-in and scale-out of the number of VMs. For scale-in,we initially deploy the dataflows on n D2 VMs with 2 slotseach, and then migrate the dataflow to dn2 e D3 VMs with 4slots each. In the scale-out experiments, we go from dn2 e D2VMs with 2 slots to n D1 VMs with 1 slot each. The totalnumber of slots used does not change, just the VMs they arepacked on. The source and sink are both assigned to a single4-slot VM. They are not migrated, to allow logging of end-to-end statistics without time-skews. Storm’s default round-robin scheduler is used to map a task instance to an availableVM slot, during initial deployment and on rebalance.

Each experiment is run for 12 mins and the user mi-gration is initiated 3 mins after the dataflow submission toensure a stable start. We log the timestamp of checkpointand user events from Storm and on the VMs to help evaluatethe metrics that we have proposed.

Table 1: Tasks, slots and VMs for the Dataflows

DAG Tasks∗ Task In-stances(Slots)

Default#VM w/2 slots

Scale-in#VM w/4 slots

Scale-out#VM w/1 slot

Linear 5 5 3 2 5Diamond 5 8 4 2 8Star 5 8 4 2 8Grid 15 21 11 6 21Traffic 11 13 7 4 13∗ Excludes Source and Sink tasks that are on a separate 4 core VM

Page 7: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

6739

1849

28 2757

3716

92

4116

7040

16

50

12

13

0

10

14

103

22

80

25

2

5121

25

52

0

50

100

150

200

250

DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR

Linear Diamond Star Grid Traffic

Tim

e (i

n s

eco

nd

s)

Restore TimeCatchupRecovery

(a) Scale-in

6435 26

46 37 2657

37 27

7036

17

6137 27

17

8

8

0

10

1

74

15

93

9

22

20

38

37

67

0

50

100

150

200

250

DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR

Linear Diamond Star Grid Traffic

Tim

e (i

n s

eco

nd

s)

Restore TimeCatchupRecovery

(b) Scale-out

Figure 5: Performance time for different strategies

5.1. Analysis

We evaluate and compare the three migration strategies,DSM, DCR and CCR, based on the proposed metrics, forthe 5 dataflows and two scaling scenarios outlined above.

Fig. 5 shows a stacked bar plot with the time taken (Yaxis) for restore, catchup and recovery for the 3 strategiesand 5 DAGs. These three are user-facing quality metrics thatare common to all approaches, and are visible sequentiallyfrom the migration request time. We see that Restore time isconsistently the least for CCR, followed by DCR and DSM.This holds across scale-in or out and for all dataflows. E.g.,for scale-in experiment for Grid, CCR takes 15 sec, DCR41 sec, and DSM 91 sec.

DSM has no drain time. The cause for this delay isdue to the INIT events that need to be sequentially sent toeach task after rebalance. Further, we observe several caseswhere the initial INIT events timeout without acking due tothe tasks not being active yet, and are resent after a 30 sectimeout. This is seen in the restore time growing in ≈ 30 secjumps, with each new wave of timed-out INITs required.While DCR also sends the INIT events sequentially, weaggressively resend them every 1 sec. While this causesduplicate events (that are ignored if a task is already initial-ized), these are few enough to justify the benefits of lowerinitialization delay. As a result, it is able to restore all tasksmore quickly than DSM, despite spending additional timedraining the dataflow. CCR broadcasts the INIT events,with the same 1 sec resend logic, and initiates the executionof the tasks even more quickly. E.g. in the scale-out ofthe Grid dataflow, the first INIT after the user initiated

476 315 245

2,083

1,513

0

500

1,000

1,500

2,000

2,500

# o

f R

eply

ed M

essa

ge

(a) Scale-in

239 112292

1,339

504

0

200

400

600

800

1,000

1,200

1,400

1,600

# o

f R

eply

ed M

essa

ge

(b) Scale-out

Figure 6: Number of failed and replayed messages for DSM

migration is received by a task at 31 sec using DCR, andat 17 sec for CCR. The difference is due to the drain time.Both DSM and DCR also have overheads for filling thedataflow after the initialization while CCR resumes fromprior in-flight state.

The total migration time for DSM in Fig 5 is muchhigher for application DAGs than for micro DAGs. E.g.,the scale-in of the Linear DAG requires ≈ 120 sec butGrid requires ≈ 220 sec. This is because the Recoverytime is higher due to more tasks utilization by INIT events,leading to subsequent impact on the Catchup and Recoveryalso. Our scaling experiments show that DSM’s performancedeteriorates with the size of the DAG. However, our DCRand CCR migration approaches are less sensitive to the sizeand complexity of the dataflow, and are able to migrate allDAGs within ≈ 50 ms time.

One of the factors in the restore time is the DrainTime for DCR and CCR. This value is larger for DCR thanCCR since the former waits for all events to flow throughthe DAG and execute while the latter captures events thatarrive at a task after the PREPARE event. This differenceis proportional to the critical path length or latency of thedataflow and the input event rate. E.g., scale-in of Gridshows a drain time of 1, 875 ms for DCR and 468 ms forCCR, while it’s scale-out drains in 1, 440 ms and 550 ms,respectively. However, this delta is smaller for micro-DAGs,with the scale-in of Linear having drain times of 905 msfor DCR and 256 ms for CCR. To verify this, we have runexperiment for a linear DAG with 50 tasks. We find thatdifference in drain times is 4, 352 ms, which is much higher.CCR however has to checkpoint the in-flight messages,besides the task state,but this incremental time is small.E.g., micro-benchmarks show that it takes just 100 ms tocheckpoint 2000 events to Redis from Storm.

Yet another component of the Restore time is the re-balance duration, when the actual Storm command runs.This time remains relatively constant across dataflows, VMcounts and strategies, with an average value of 7.26 secs.

The Catchup time in Fig. 5 is the time to receive thelast old tuple at the sink after migration. This is larger forDSM than CCR, and it is absent for DCR since there are noin-flight events. In DSM, lost events are replayed after theiracking timeout. The old events that were discarded due torebalance will be re-emitted by the source after the 30 secacking timeout occurs. Hence, there is up to 30 secs delay

Page 8: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750Relative Time (in secs)

0

10

20

30

40

50

60

70

80

90

100

Thro

ughput

(msg

s/se

c)

217

287

427

497

620

Input Rate

Output Rate

(a) DSM

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750Relative Time (in secs)

0

10

20

30

40

50

60

70

80

90

100

Thro

ughput

(msg

s/se

c)

287

Input Rate

Output Rate

(b) DCR

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750Relative Time (in secs)

0

10

20

30

40

50

60

70

80

90

100

Thro

ughput

(msg

s/se

c)

181

Input Rate

Output Rate

(c) CCR

Figure 7: Timeline plot showing Input and Output through-put during the scale-in of Grid dataflow. Migration requesttime is shown as “0” on X axis.

for the old events to pass through the dataflow. This is clearwhen we examine the timeline plot for the input and outputevent rates shown for the scale-in of Grid dataflow in Fig. 7a.We see spikes of the input rate at 30 sec intervals after the200 sec point since the migration was requested. The firstspike indicates the replay of the old in-flight events. Theother spikes are due to the replayed old events or the newlyemitted events not being processed quickly by the dataflowdue to a high load, and being replayed yet again. The highoutput rate during this period shows that the dataflow ispushed to its limits. Fig. 6 plots the number of such eventsthat are replayed for DSM. These range from 112 for thescale-out of Diamond to 2, 083 for the sclae-in of Grid. Thevalues are larger for the application DAGs than the microDAGs more events are timed-out in the larger DAGs, andreplayed.

In CCR, catchup time is comparatively smaller as the oldevents are immediately replayed after being restored fromRedis on receiving the INIT event. The catchup time ishigher for application DAGs than micro DAGs. For largerDAGs, the number of in-flight events that are lost are likelyto be higher due to more number of tasks/input buffers.

147 128100

135100 90

130 116 110

224

148 130

208

140 128

0

50

100

150

200

250

DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR

Linear Diamond Star Grid Traffic

Tim

e (i

n S

ec)

(a) Scale-in

139 120 107135 131 112

147 130 118

200146 140

183137 120

0

50

100

150

200

250

DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR DSM DCR CCR

Linear Diamond Star Grid Traffic

Tim

e (i

n S

ec)

(b) Scale-out

Figure 8: Stabilization time for different strategies

Hence, the replayed event count will be more, and they willalso require more time to reach to sink from the source task.The catchup time is almost the same for both scale in andout. One may expect some benefits with fewer VMs in scalein due to collocation of tasks that avoids network latency,but the round-robin Storm scheduler may not exploit this.

Fig. 5 shows the Recovery time to receive the last failedand replayed event at the sink, be they old or new replayedevents. This is indicative of the dataflow approaching asteady state, and not losing and replaying events due tothe migration. The Restore time is present and high forDSM for all DAGs and for scale-in and out. There is norecovery time required for CCR and DCR since we seeno event losses to be replayed. This DSM behavior is acascading effect of its restore and recovery times, that causethe source task to buffer more events. These events whenreleased overwhelm the dataflow causing event timeouts andreplay. In fact, the recovery time can be directly co-relatedwith the high stabilization time for DSM in Fig 8, relativeto DCR and CCR. In fact, some experiments show DSMtake 60 secs longer than DCR and CCR to reach a stacbeoutput rate.

Lastly, we analyze the input and output throughputs, andend-to-end latency for the dataflows during migration. Fig 7is a timeline plot of the input rate for the dataflow at thesource task and the output rate seen at the sink task, duringscale-in of the Grid DAG. Time 0 on the X Axis indicatesthe start of the migration request. Fig 9 similarly showsthe corresponding average event latency over a window of10 secs seen for output events from the Grid dataflow.

The throughput plot, Fig 7, shows that the steady inputrate is 8 events/sec and output rate is 32 events/sec, as theselectivity of the Grid DAG is 1 : 4. We observe that the sinktask is paused in DCR and CCR during migration, with zeroinput rate, but not in DSM. This reduces the interleavingof old and new events after migration and avoids eventlosses due to time-outs. The single input rate peak for DCRand CCR show the backlogged events emitted when sinkis resumed. As mentioned before, multiple such peaks exist

Page 9: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

0 50 100 150 200 250 300 350 400 450 500Relative Time (in secs)

0

500

1000

1500

2000

2500

3000

Late

ncy

(in

mill

isec)

A B C DE

A BCD

E

A B CD

E

DSM

DCR

CCR

Figure 9: Timeline plot showing Average Latency over amoving window of 10 secs (≈ 80 events) for scale-in ofGrid dataflow. Labels A..F on vertical dashed lines denoteRestore Duration (A → B), Catchup Duration (B → C),Recovery Duration (C → D), Stabilization Time (D → E).Horizontal solid RGB lines indicate the stable latency.

for DSM. The output throughput has a small increase during≈ 180 − 300 secs, showing the INIT events flowing, thecaptured events in Redis replayed for DCR, and the dataflowbeing filled with the buffered events in the source task. Wecan also clearly see that DSM takes much longer to reach astable output rate, at ≈ 480 secs, relative to DCR and CCRwhich flatten out at ≈ 320 secs.

The latency plot, Fig 9, shows three horizontalRed/Blue/Green lines that represent the median latency ofthe DAG when stable for the respective strategy. The verticaldashed lines indicate the event timestamp corresponding tothe various metrics, as labeled in the caption. We can seethat average latency of the DAG is high for DCR and CCRbetween ≈ 140−300 secs, but returns to the steady latencybeyond that. However, with DSM, we reach this much laterat ≈ 390 secs. Similarly, the other metrics reported in Fig 5and the analysis above can be correlated here.

Summary. Our results show that CCR can be used forreliable DAG migration with a recovery time of less than27 sec for any DAG size, during which time we do notsee any output events. Also, it can catchup with old eventswithin 50 secs and the output rates stabilize within 140 secs.These indicate a rapid completion of migration for Stormdataflows that allow it to exploit event the per-minute andper-second billing that Cloud providers are offering. This isalso important for latency-sensitive streaming applications.DCR can be preferred if we need guarantees that old eventsbefore migration must be processed separately, and notinterleave with new events. This may happen if the dataflowlogic is being changed as part of migration. However, itsdrain time is sensitive to the critical path of the DAG orinput event rate. DSM performs uniformly bad across allmetrics. We can see little different in the impact of eitherscaling in or scaling out. So our migration techniques can beeasily adapted to enact diverse elastic scheduling scenariosfor streaming applications.

6. Related Work

Several papers have examined elasticity for stream pro-cessing systems. This is true not just within Clouds, also

for Edge based systems like [13] that motivates dynamicmigration of the task due to edge resource outages.

Stela [6] built on Apache Storm does on-demand scaling,and is similar to our work. It optimizes the dataflow through-put while limiting the interruption of the running applica-tion. The paper assumes that operators are stateless in Storm,which avoids issues of state handling and consistency. Ittoo uses convergence time, similar to our stabilization time,as the comparison metric but does not consider messagedelivery guarantees or message failure due to migration intoaccount. It uses effective throughput as a measure to decideif an operator has to be migrated for the given resources. Wedo not focus on the resource allocation problem here, butonly on reliable migration once the allocation is decided.

E-Storm [14] has extended the default Redis-based statepreservation in Storm to a replication-based state man-agement that actively maintains multiple state backups ondifferent nodes. The recovery module then restores the loststate by inter-task state transfer. This can only mask the lossof a task states in case of JVM and the host crash, but notwhen the DAG is migrated during scaling in/out. They claimimprovements in throughput by avoiding access to a remotedata store for the state.

Esc [15] considers events as key-value pairs. Dynamicload balancing is done by mapping keys to different VMsusing hash functions. The execution model instantiates mul-tiple tasks at runtime as needed. While it can dynamicallyadjust the Cloud resources based on the current load, it usesits custom streaming platform. We instead investigate sup-porting reliable elasticity within the popular Storm DSPS.

Gedik, et al. [7] have proposed elastic auto-parallelization for balancing the dynamic load in caseof streaming applications. They dynamically adjust thedata parallelism to handle the congestion using a controlalgorithm. An incremental migration protocol is proposedfor maintaining states while scaling, minimizing the amountof state transfer between hosts. But their evaluation doesnot consider message reliability during the scaling and statemigration operation.

ElasticStream [5] uses a hybrid Cloud model for streamprocessing to address a lack of resources on the local streamcomputing cluster. They dynamically adjust the resourcesrequired to respond to dynamic rates, with the goal ofminimizing the cost for using the Cloud while satisfyingtheir QoS. The implementation on IBM’s System S is ableto assign or remove computational resources dynamically.Though the system transfers data stream processing toCloud, they do not talk about message reliabilty and statehandling during the migration. This may lead to loss of in-flight messages and internal state of stateful tasks duringmigration.

Our prior work [16] has proposed the use of alternatetasks to allow changes to the streaming dataflow structure,and uses these to adapts to varying performance of cloudVMs. Dynamic input rates are managed by allocating re-sources for alternate tasks, thus making a trade-off betweencost and QoS on Cloud. Our current work addresses static

Page 10: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

dataflow structures but can adapt to changes in input ratesand support elastic migration.

A related topic to elasticity of streaming platform isVM migration in the Cloud, contrasting PaaS and IaaSapproaches. [17] have proposed a VM migration algorithm,Scattered, that minimizes VM migrations in over-committeddata centers. These VMs are migrated to underutilized phys-ical hosts to balance its utilization. Voorsluys, et al. [18]evaluate the effects of live VM migration on the performanceof running applications. Others [19] have proposed a 3 phaseVM migration approach: suspend, copy and resume, wherethe VM memory is first captured, it is suspended at theorigin host and then the VM’s configuration and memorystate transferred to the destination host. During resume,the memory state is restored from the snapshot and thenexecution is resumed. This basic idea is quite similar to ourCCR approach.

PaaS Cloud providers have to manage resources of cus-tomer applications to trade-off IaaS costs and QoS. Cloud-Scale [20] adjusts the resources assigned to each VM in agiven host using a workload predictor. When the forecastedload for a host exceeds its capacity, the VM is migrated. [21]propose an algorithm that analyzes the negative impact ofVM migration on an IaaS provider. An optimization problemis formalized and best solution is provided using a hillclimbing search.

Proactive and preventive checkpointing has been ex-plored for large HPC systems [22] to increase the computa-tional efficiency during failures. Formal models have beenproposed for failure prediction based on the checkpoint cost,the failure distribution, and the probability of success ofthe proactive action. [23] have used an incremental Check-point/Restart approach that tries to reduce the large memoryuse by switching to reliable storage for full checkpoints.

7. Conclusions

In this paper, we have presented dynamic migration tech-niques for streaming dataflows that help fast data platformsrapidly exploit the elasticity offered by Clouds. We haveimplemented and validated the DCR and CCR strategiesthat we have proposed using existing rebalance and Statecheckpointing mechanisms available in Apache Storm, andcompared it with its native migration feature. Our validationsusing micro and application DAGs and for scaling outand in on Cloud VMs show significant benefits of boththese approaches over DSM. CCR, in particular, is ableto restore the dataflow state and behavior after migrationwithin 50 sec while the default approach take over 100 secs,and grows with the DAG size. This makes CCR beneficialfor fine-grained elasticity and cost reduction on pay-as-you-go Cloud IaaS, while not compromising the performanceof such fast data applications. The uses of this capabilityare plenty. We can further extend and use DAG migrationfor interesting problems like updating the task logic by re-wiring the DAG on the fly, task migration due to insufficientstorage availability, for dynamic DAG resource updation tosupport certain latency requirements.

References

[1] Y. Simmhan, S. Aman, A. Kumbhare, R. Liu, S. Stevens, Q. Zhou, andV. Prasanna, “Cloud-based software platform for big data analyticsin smart grids,” Computing in Science Engineering (CiSE), vol. 15,no. 4, pp. 38–47, 2013.

[2] D. Laney, “3d data management: Controlling data volume, velocityand variety,” META Group Research Note, vol. 6, p. 70, 2001.

[3] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, “R-storm: Resource-aware scheduling in storm,” in Middleware Confer-ence. ACM, 2015, pp. 149–161.

[4] A. Shukla and Y. Simmhan, “Model-driven scheduling for distributedstream processing systems,” arXiv preprint arXiv:1702.01785, p. 54,2017.

[5] A. Ishii and T. Suzumura, “Elastic stream computing with clouds,” inInternational Conference on Cloud Computing, ser. CLOUD. IEEEComputer Society, 2011, pp. 195–202.

[6] L. Xu, B. Peng, and I. Gupta, “Stela: Enabling stream processingsystems to scale-in and scale-out on-demand,” in IEEE InternationalConference on Cloud Engineering (IC2E). IEEE, 2016, pp. 22–31.

[7] B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu, “Elastic scaling fordata stream processing,” TPDS, vol. 25, no. 6, pp. 1447–1463, 2014.

[8] L. Fischer and A. Bernstein, “Workload scheduling in distributedstream processors using graph partitioning,” in Big Data (Big Data),2015 IEEE International Conference on. IEEE, 2015, pp. 124–133.

[9] “Guaranteeing message processing,” http://storm.apache.org/releases/1.1.0/Guaranteeing-message-processing.html, April 2017.

[10] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder-baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng et al., “Benchmarkingstreaming computation engines: Storm, flink and spark streaming,”in Parallel and Distributed Processing Symposium Workshops, 2016IEEE International. IEEE, 2016, pp. 1789–1792.

[11] T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman,R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle, “Mill-wheel: fault-tolerant stream processing at internet scale,” Proceedingsof the VLDB Endowment, vol. 6, no. 11, pp. 1033–1044, 2013.

[12] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Ver-scheure, H. Koutsopoulos, and C. Moran, “Ibm infosphere streamsfor scalable, real-time, intelligent transportation services,” in ACMSIGMOD, 2010.

[13] I. Lujic, V. De Maio, and I. Brandic, “Efficient edge storage man-agement based on near real-time forecasts,” in ICFEC. IEEE, 2017,pp. 21–30.

[14] X. Liu, A. Harwood, S. Karunasekera, B. Rubinstein, and R. Buyya,“E-storm: Replication-based state management in distributed streamprocessing systems,” in Parallel Processing (ICPP). IEEE, 2017,pp. 571–580.

[15] B. Satzger, W. Hummer, P. Leitner, and S. Dustdar, “Esc: Towards anelastic stream computing platform for the cloud,” in IEEE CLOUD,2011.

[16] A. Kumbhare, Y. Simmhan, and V. K. Prasanna, “Exploiting appli-cation dynamism and cloud elasticity for continuous dataflows,” inIEEE/ACM Supercomputing, 2013.

[17] X. Zhang, Z.-Y. Shae, S. Zheng, and H. Jamjoom, “Virtual machinemigration in an over-committed cloud,” in Network Operations andManagement Symposium (NOMS). IEEE, 2012, pp. 196–203.

[18] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, “Cost of vir-tual machine live migration in clouds: A performance evaluation,” inProceedings of the 1st International Conference on Cloud Computing,ser. CloudCom, 2009, pp. 254–265.

[19] M. Zhao and R. J. Figueiredo, “Experimental study of virtual machinemigration in support of reservation of cluster resources,” in Proceed-ings of the 2Nd International Workshop on Virtualization Technologyin Distributed Computing, ser. VTDC, 2007, pp. 5:1–5:8.

[20] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes, “Cloudscale: elasticresource scaling for multi-tenant cloud systems,” in Cloud Computing.ACM, 2011, p. 5.

[21] E. Casalicchio, D. A. Menasce, and A. Aldhalaan, “Autonomicresource provisioning in cloud systems with availability goals,” inProceedings of the 2013 ACM Cloud and Autonomic ComputingConference, ser. CAC ’13. ACM, 2013, pp. 1:1–1:10.

Page 11: Toward Reliable and Rapid Elasticity for Streaming ... · Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds Anshu Shukla and Yogesh Simmhan Department of Computational

[22] M. S. Bouguerra, A. Gainaru, L. B. Gomez, F. Cappello, S. Matsuoka,and N. Maruyam, “Improving the computing efficiency of hpc sys-tems using a combination of proactive and preventive checkpointing,”in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27thInternational Symposium on. IEEE, 2013, pp. 501–512.

[23] N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun,and S. L. Scott, “Reliability-aware approach: An incremental check-point/restart model in hpc environments,” in Cluster Computing andthe Grid, 2008. CCGRID’08. 8th IEEE International Symposium on.IEEE, 2008, pp. 783–788.