trading timeliness and accuracy in geo-distributed ... · ern streaming analytics systems therefore...

13
Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics Benjamin Heintz Department of Computer Science University of Minnesota [email protected] Abhishek Chandra Department of Computer Science University of Minnesota [email protected] Ramesh K. Sitaraman Department of Computer Science UMass Amherst & Akamai Tech. [email protected] Abstract Many applications must ingest rapid data streams and pro- duce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from ge- ographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hub- and-spoke model, where several edge servers perform partial computation before forwarding results over a wide-area net- work (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed—i.e., stale—results, or sac- rifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggrega- tion, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimiz- ing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline al- gorithms as references, we present practical online algo- rithms for effectively trading off timeliness and accuracy un- der bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on Planet- Lab. Our experiments show that our proposed algorithms re- duce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batching- based aggregation algorithm across a diverse set of aggrega- tion functions. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoCC ’16, October 05-07, 2016, Santa Clara, CA, USA. c 2016 ACM. ISBN 978-1-4503-4525-5/16/10. . . $15.00. DOI: http://dx.doi.org/10.1145/2987550.2987580 Categories and Subject Descriptors C.2.4 [Computer- Communication Networks]: Distributed Systems—Distributed Applications Keywords Geo-distributed systems, stream processing, ag- gregation, approximation 1. Introduction Stream computing has emerged as a critically important topic in recent years. Whether comprising sensor data from smart homes, user interaction logs from streaming video clients, or server logs from a content delivery network, rapid streams of data represent rich sources of meaningful and actionable information. The key challenge lies in extract- ing this information quickly, while it is still relevant. Mod- ern streaming analytics systems therefore face the daunt- ing task of ingesting massive amounts of data and perform- ing computation in near-real-time. Several scalable systems have been proposed to address this challenge [2, 3, 6, 10, 12, 26, 31, 36]. Adding to the challenge is the fact that many interest- ing data streams are geographically distributed. For exam- ple, smart home sensor data and CDN log data originate from sources that are physically fixed at diverse locations all around the globe; such data are truly “born distributed” [33]. The distributed infrastructure of a typical geo-distributed an- alytics service such as Google Analytics or Akamai Media Analytics follows a hub-and-spoke model. In this architec- ture, numerous distributed data sources send their streams of data to nearby “edge” servers, which perform partial com- putation before forwarding results to a central location re- sponsible for performing any remaining computation, stor- ing results, and serving responses to analytics users’ queries. While this central location is often a well-provisioned data center, resources are typically more limited at the edge lo- cations. Perhaps most critically, the wide-area network con- nection between edges and the center may have highly con- strained bandwidth. Geo-distributed analytics systems have received growing research attention in recent years [24, 30, 32–34] 361

Upload: others

Post on 11-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

Trading Timeliness and Accuracy inGeo-Distributed Streaming Analytics

Benjamin HeintzDepartment of Computer Science

University of [email protected]

Abhishek ChandraDepartment of Computer Science

University of [email protected]

Ramesh K. SitaramanDepartment of Computer ScienceUMass Amherst & Akamai Tech.

[email protected]

AbstractMany applications must ingest rapid data streams and pro-duce analytics results in near-real-time. It is increasinglycommon for inputs to such applications to originate from ge-ographically distributed sources. The typical infrastructurefor processing such geo-distributed streams follows a hub-and-spoke model, where several edge servers perform partialcomputation before forwarding results over a wide-area net-work (WAN) to a central location for final processing. Due tolimited WAN bandwidth, it is not always possible to produceexact results. In such cases, applications must either sacrificetimeliness by allowing delayed—i.e., stale—results, or sac-rifice accuracy by allowing some error in final results.

In this paper, we focus on windowed grouped aggrega-tion, an important and widely used primitive in streaminganalytics, and we study the tradeoff between staleness anderror. We present optimal offline algorithms for minimiz-ing staleness under an error constraint and for minimizingerror under a staleness constraint. Using these offline al-gorithms as references, we present practical online algo-rithms for effectively trading off timeliness and accuracy un-der bandwidth limitations. Using a workload derived froman analytics service offered by a large commercial CDN,we demonstrate the effectiveness of our techniques throughboth trace-driven simulation as well as experiments on anApache Storm-based implementation deployed on Planet-Lab. Our experiments show that our proposed algorithms re-duce staleness by 81.8% to 96.6%, and error by 83.4% to99.1% compared to a practical random sampling/batching-based aggregation algorithm across a diverse set of aggrega-tion functions.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’16, October 05-07, 2016, Santa Clara, CA, USA.c© 2016 ACM. ISBN 978-1-4503-4525-5/16/10. . . $15.00.

DOI: http://dx.doi.org/10.1145/2987550.2987580

Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems—DistributedApplications

Keywords Geo-distributed systems, stream processing, ag-gregation, approximation

1. IntroductionStream computing has emerged as a critically importanttopic in recent years. Whether comprising sensor data fromsmart homes, user interaction logs from streaming videoclients, or server logs from a content delivery network, rapidstreams of data represent rich sources of meaningful andactionable information. The key challenge lies in extract-ing this information quickly, while it is still relevant. Mod-ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation in near-real-time. Several scalable systemshave been proposed to address this challenge [2, 3, 6, 10, 12,26, 31, 36].

Adding to the challenge is the fact that many interest-ing data streams are geographically distributed. For exam-ple, smart home sensor data and CDN log data originatefrom sources that are physically fixed at diverse locations allaround the globe; such data are truly “born distributed” [33].The distributed infrastructure of a typical geo-distributed an-alytics service such as Google Analytics or Akamai MediaAnalytics follows a hub-and-spoke model. In this architec-ture, numerous distributed data sources send their streams ofdata to nearby “edge” servers, which perform partial com-putation before forwarding results to a central location re-sponsible for performing any remaining computation, stor-ing results, and serving responses to analytics users’ queries.While this central location is often a well-provisioned datacenter, resources are typically more limited at the edge lo-cations. Perhaps most critically, the wide-area network con-nection between edges and the center may have highly con-strained bandwidth. Geo-distributed analytics systems havereceived growing research attention in recent years [24, 30,32–34]

361

Page 2: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

A crucial question for a geo-distributed analytics systemis how best to utilize resources at both the edges and thecenter in order to deliver timely results. In particular, a geo-distributed analytics system must determine how much com-putation to perform at the edges, and how much to leave forthe center (i.e., where to compute), as well as when to sendpartial results from edges to the center.

In this paper, we examine these questions in the context ofwindowed grouped aggregation, a key primitive for stream-ing analytics, where data streams are logically subdividedinto non-overlapping time windows within which recordsare grouped by common attributes, and summaries are com-puted for each group. This pattern is ubiquitous in bothstreaming and batch settings. For example, the SQL Group

By construct allows concise expression of grouped aggrega-tion, while grouped aggregation represents the fundamentalabstraction underlying MapReduce. Although it may seemrestrictive, a surprisingly broad range of applications fit intothe grouped aggregation pattern [10].

Our prior work [22] examined questions of where to com-pute and when to communicate in the context of exact com-putation. It presented the design of algorithms for perform-ing windowed grouped aggregation in order to optimize twokey metrics of any geo-distributed streaming analytics ser-vice: WAN traffic, and staleness (the delay in getting the re-sult for a time window), which are respectively measures ofcost [4, 21] and performance. There, we assumed that (a)applications require exact results, and (b) resources—WANbandwidth in particular—were sufficient to deliver exact re-sults. In general, however, these assumptions do not alwayshold. For one, it is not always feasible to compute exact re-sults with bounded staleness [32]. Further, many real-worldapplications can tolerate some staleness or inaccuracy intheir final results, albeit with diverse preferences. For in-stance, a network administrator may need to be alerted topotential network overloads quickly (within a few seconds orminutes), even if there is some degree of error in the resultsdescribing network load. On the other hand, a Web analystmight have only a small tolerance for error (say, <1%) inthe application statistics (e.g., number of page hits), but bewilling to wait for some time to obtain these results with thedesired accuracy.

In this paper, we study the staleness-error tradeoff, rec-ognizing that applications have diverse requirements: somemay tolerate higher staleness in order to achieve lower error,and vice versa. We devise both theoretically optimal as wellas practical algorithms to solve two complementary prob-lems: minimize staleness under an error constraint, and min-imize error under a staleness constraint. Our algorithms en-able geo-distributed streaming analytics systems to supporta diverse range of application requirements, whether WANcapacity is plentiful or highly constrained.

Research Contributions• We study the tradeoff between staleness and error (mea-

sures of timeliness and accuracy, respectively) in a geo-distributed stream analytics setting.

• We present optimal offline algorithms that allow us to op-timize staleness (resp., error) under an error (resp., stale-ness) constraint.

• Using these offline algorithms as references, we presentpractical online algorithms to efficiently trade off stale-ness and error. These practical algorithms are based onthe key insight of representing grouped aggregation atthe edge as a two-level cache. This formulation gener-alizes our caching-based framework for exact windowedgrouped aggregation [22] by introducing cache partition-ing policies to identify which partial results must be sentand which ones can be discarded.

• We demonstrate the practicality and efficacy of these al-gorithms through both trace-driven simulations and im-plementation in Apache Storm [3], deployed on a Planet-Lab [1] testbed. Using workloads derived from traces ofa popular web analytics service offered by Akamai [28],a large commercial content delivery network, our experi-ments show that our algorithms reduce staleness by 81.8%to 96.6%, and error by 83.4% to 99.1% compared toa practical random sampling/batching-based aggregationalgorithm.

• We demonstrate that our techniques apply across a di-verse set of aggregates, from distributive and algebraic ag-gregates [20] such as Sum and Max to holistic aggregatessuch as unique count (via sketch data structures such asHyperLogLog [18]).

2. Problem Formulation2.1 Problem StatementSystem ModelWe consider the typical hub-and-spoke architecture of ananalytics system with a center and multiple edges. Datastreams are first sent from each source to a nearby edge.The edges collect and (optionally, partially) aggregate thedata. This aggregated data can then be sent from the edgesto the center where any remaining aggregation takes place.The final aggregated results are available at the center. Usersof the analytics service query the center to visualize theiranalytics results. To perform grouped aggregation, each edgeruns a local aggregation algorithm: it acts independently todecide when and how much to aggregate the incoming data.(Coordination between edges is an interesting topic [13] forfuture work, but is outside scope of the present paper.)

Windowed Grouped Aggregation over Data StreamsA data stream comprises records of the form (k,v) where kis the key and v is the value. Data records of a stream arrive

362

Page 3: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

at the edge over time. Each key k can be multi-dimensional,with each dimension corresponding to a data attribute. Agroup is a set of records sharing the same key.

Customarily, time is divided into non-overlapping inter-vals, or windows, of user-specified length W .1 Windowedgrouped aggregation over a time window [T,T +W ) is thendefined as follows from an input/output perspective. The in-put is the set of data records that arrive within the timewindow. The output is determined by first placing the datarecords into groups, where each group is a set of recordswith the same key. For each group {(k,vi) : 1≤ i≤ n} corre-sponding to the n records in the time window that have key k,an aggregate value Vk = v1⊕v2⊕·· ·⊕vn is computed, where⊕ is an application-defined associative binary operator; e.g.,Sum, Max, or HyperLogLog merge.2

To compute windowed grouped aggregation, the dis-tributed infrastructure can perform aggregation at both theedge and the center. The records that arrive at the edge canbe partially aggregated there, and the edge can maintain a setof partial aggregates, one for each distinct key k. The edgemay transmit, or flush these aggregates to the center; werefer to these flushed records as updates. The center can fur-ther apply the aggregation operator ⊕ on incoming updatesas needed in order to generate the final aggregate result. Weassume that the computational overhead of the aggregationoperator is a small constant compared to the network over-head of transmitting an update.

Approximate Windowed Grouped AggregationAn aggregation algorithm runs on the edge and takes asinput the sequence of arrivals of data records in a giventime window [T,T +W ). The algorithm produces as outputa sequence of updates that are sent to the center.

For each distinct key k with nk > 0 arrivals in the timewindow, suppose that the ith data record (k,vi,k) arrives attime ai,k, where T ≤ ai,k < T +W and 1 ≤ i ≤ nk. For eachsuch key k, the output of the aggregation algorithm is asequence of mk updates, where 0≤ mk ≤ nk. The jth update(k, v j,k) departs3 for the center at time d j,k where 1≤ j≤mk.This update aggregates all values for key k that have arrivedbut have not yet been included in an update.

The final, possibly approximate, aggregated value for keyk is given by Vk = v1,k ⊕ v2,k ⊕ ·· · ⊕ vmk,k. Approximationarises in two cases. The first is when, for key k with nk > 0arrivals, no update is flushed; i.e., mk = 0. This leads to anomission error for key k. The second is when at least onerecord arrives for key k after the final update is flushed; i.e.,when dmk,k < ank,k. This leads to a residual error for key k.

1 These are often called tumbling windows in analytics terminology.2 More formally, ⊕ is any associative binary operator such that there existsa semigroup (S,⊕).3 Upon departure from the edge, an update is handed off to the network fortransmission.

Optimization MetricsStaleness is defined as the smallest time interval s suchthat the results of grouped aggregation for the time window[T,T +W ) are available at the center at time T +W + s. Inother words, staleness quantifies the time elapsed from whenthe window closes to when the last update for that windowreaches the center and is included in the final aggregate.

Over a window, we define the per-key error ek for a keyk to be the difference between the final aggregated value Vkand its true aggregated value Vk: ek = error(Vk,Vk), whereerror is application-defined.4 Error is then defined as themaximum error over all keys: Error = maxk ek. Such errorarises when an algorithm flushes updates for some keys priorto receiving all arrivals for those keys, or when it omitsflushes for some keys altogether.5

As we show in the next section, staleness and error arefundamentally opposing requirements. The goal of an ag-gregation algorithm is therefore either to minimize stalenessgiven an error constraint, or to minimize error given a stal-eness constraint.

2.2 The Staleness-Error TradeoffMechanics of the TradeoffTo understand the mechanics behind the staleness-errortradeoff, it is useful to consider what causes staleness anderror in the first place. Recall from Section 2.1 that stalenessfor a given time window is defined as the delay between theend of that window and the time at which the last updatefor that window reaches the center.6 Error is caused by anyupdates that are not delivered to the center, either becausethe edge sent only partial aggregates, or because the edgeomitted some aggregates altogether.

Intuitively, the two main causes of high staleness are ei-ther using too much network bandwidth (causing delays dueto network congestion), or delaying transmissions until theend of the window. Thus, we can reduce staleness by avoid-ing some updates altogether (to avoid network congestion),and sending updates earlier during the window. Unfortu-nately, both of these options for reducing staleness lead di-rectly to sources of error. First, if no update is ever sent fora given key, then the center never gets any aggregate valuefor this key, leading to an omission error for the key. Sec-ond, if the last update for a key is scheduled prior to the finalarrival for that key, then the center will see only a partial ag-gregate, leading to a residual error for that key. Thus, we seethat there is a fundamental tradeoff between achieving lowstaleness and low error.

4 Our work applies to both absolute and relative error definitions.5 Our work also applies to approximate value types, as long as the errorfunction appropriately incorporates this value-level approximation.6 Wide-area latency contributes to this delay, but this component of delay isa function of the underlying network infrastructure and is not somethingwe can directly control through our algorithms. We therefore focus ourattention on network delays due to wide-area bandwidth constraints.

363

Page 4: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

Challenges in Optimizing Along the Tradeoff CurveTo understand the challenges of optimizing for either ofthese metrics while bounding the other, consider two alter-nate approaches to grouped aggregation: streaming, whichimmediately sends updates to the center without any aggre-gation at the edge; and batching, which aggregates all dataduring a time window at the edge and only flushes updatesto the center at the end of the window. When bandwidth issufficient, streaming can deliver extremely low staleness andhigh accuracy, as arrivals are flushed to the center withoutadditional delay, and updates reach the center quickly. Whenbandwidth is constrained, however, it can lead to high stal-eness and error. This is because it fails to take advantage ofedge resources to reduce the volume of traffic flowing acrossthe wide-area network, leading to high congestion and un-bounded network delays and in turn unbounded staleness.

Batching, on the other hand, leverages computation atedge resources to reduce network bandwidth consumption.When the end of the window arrives, a batching algorithmhas the final values for every key on hand, and it can prior-itize important updates in order to reach an error constraintquickly, or to reduce error rapidly until reaching a stalenessconstraint. On the other hand, batching introduces delays bydeferring all flushes until the end of the window, leading tohigh staleness for any given error.

One alternate approach is a random sampling algorithmthat prioritizes important transmissions after the end of thewindow, but sends a subset of aggregate values selectedrandomly during the window. This approach improves overbatching by sending updates earlier during the window, thusreducing staleness. It also improves over streaming underbandwidth constraints by reducing the network traffic. As weshow in Sections 3 and 4, however, this approach can stillyield high error and staleness due to lack of prioritizationamong different updates during the window.

To satisfy our optimization goals, we need more princi-pled approaches.

3. Offline AlgorithmsWe now consider the two complementary optimization prob-lems of minimizing staleness (resp., error) under an error(resp., staleness) constraint. Before we present practical on-line algorithms to solve these problems, we consider optimaloffline algorithms. Although these offline algorithms cannotbe applied directly in practice, they serve both as baselinesfor evaluating the effectiveness of our online algorithms, andalso as design inspiration, helping us to identify heuristicsthat practical online algorithms might employ in order to em-ulate optimal algorithms.

In this section, we provide proofs only for the main the-orems and omit proofs for other lemmas due to space con-straints. For details, see our technical report [23].

3.1 Minimizing Staleness (Error-Bound)The first optimization problem we consider is minimizingstaleness under an error constraint (error ≤ E), where E isan application-specified error tolerance value.

In this case, the goal is to flush only as many updatesas is strictly required, and to flush each of these updates asearly as possible such that the error constraint is satisfied. Inshort, an offline optimal algorithm achieves this by flushingan update for each key as soon as the aggregate value for thatkey falls within the error constraint.

Throughout this section, consider the time window oflength W beginning at time T , let ⊕ denote our binaryaggregation function, and let n denote the number of uniquekeys arriving during the current window. Define the prefixaggregate Vk(t) for key k at time t to be the aggregate valueof all arrivals for key k during the current window priorto time t. We define the prefix aggregate Vk(t) to have alogical zero value prior to the first arrival for key k. Leterror(x,x) denote the error of the aggregate value x withrespect to the true value x. Further, define prefix error ek(t)for key k at time t to be the error of the prefix aggregatefor key k with respect to the true final aggregate: ek(t) =error(Vk(t),Vk(T +W )), for T ≤ t < T +W . We refer tothe prefix error of key k at the beginning of the window—ek(T )—as the initial prefix error of key k.

Definition 1 (Eager Prefix Error). Given an error constraintE, the Eager Prefix Error (EPE) algorithm flushes each keyk at the first time t such that ek(t) ≤ E. If ek(T ) ≤ E, thenEPE avoids flushing key k altogether.

Theorem 1. The EPE algorithm is optimal.

Proof. Such an algorithm satisfies the error constraint byconstruction, and because flushes are issued as early as pos-sible for each key, and only if strictly necessary, there cannotexist another schedule that achieves lower staleness.

3.2 Minimizing Error (Staleness-Bound)Next, we consider the optimization problem of minimizingerror under a staleness constraint (staleness ≤ S), where S isan application-specified staleness tolerance (or deadline).

To minimize error under a staleness constraint, we ab-stract the wide-area network as a sequence of contiguousslots, each representing the ability to transmit a single up-date. The duration of a single slot in seconds is then 1

b ,where b represents the available network bandwidth in up-dates per second.7 If S denotes the staleness constraint inseconds, then the final slot ends S seconds after the end ofthe window. Figure 1 illustrates this model for the time win-dow [T,T +W ).

Note that there is no reason to flush a key more thanonce during any given window, as any updates prior to thelast could simply be aggregated into the final update before

7 In general, bandwidth need not be constant.

364

Page 5: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

time

T

windowbegins

T+W

windowends

T+W+S

deadline

n slots

T+s

previousresults available

Figure 1. We view the network as a sequence of contiguousslots. Shaded slots are unavailable to the current window dueto the previous window’s staleness.

it is flushed. Given this fact, we can focus on schedulingflushes for the n unique keys that arrive during the windowby assigning each into one of n slots.

Note that flushes from the previous window occupy thenetwork for the first s seconds of the current window, wheres≤ S is the staleness of the previous window. These slots aretherefore unavailable to the current window, and assigninga key to such a slot has the effect of sending no value forthat key. In general, the first slot of the current window maybegin prior to this time, or in fact prior to the beginning ofthe current window.

To understand how we assign keys to slots, we must firstintroduce the notion of potential error. The potential error8

Ek(t) for key k at time t is defined to be the error thatthe center would see for key k if key k were assigned toa slot beginning at time t. Recall that, for t < T + s, slotsare unavailable, as the network is still in use transmittingupdates from the previous window. Assigning a key to sucha slot therefore has the effect of sending no value for thatkey. Similarly, if a key is assigned to a slot prior to that key’sfirst arrival, there is no value to send, so making such anassignment is equivalent to sending no value for that key.We refer to either of these cases as omitting a key. Overallthen, potential error is given by

Ek(t) =

{ek(T ) if t < T + s,ek(t) otherwise.

We assume a monotonicity property: the potential errorEk(t) of a key k at any time t is no larger than its potentialerror Ek(t ′) at a prior time t ′ < t within the window, i.e.,Ek(t)≤ Ek(t ′).9

Definition 2 (Smallest Potential Error First). The SmallestPotential Error First (SPEF) algorithm iterates over the nslots beginning with the earliest, assigning to each slot akey with the smallest potential error that has not yet beenassigned to a slot.

8 Throughout this section, we discuss error with respect to the window[T,T +W ).9 Note that we can easily extend our algorithms to relax this monotonicityproperty. See our technical report [23] for details.

Lemma 1. A schedule D which flushes keys i and j at timesti and t j such that ti ≤ t j and Ei(ti) ≤ E j(ti) (i.e., in SPEForder) cannot have higher error than a schedule D′ thatswaps these keys.

Lemma 2. A schedule D that flushes only m < n keys canbe transformed into another schedule D′ that flushes n keyswithout increasing staleness or error.

Theorem 2. The SPEF algorithm is optimal.

Proof. The SPEF algorithm satisfies the staleness constraintdue to our definition of slots. To see why it minimizes error,let DSPEF denote a sequence of flushes in smaller-potential-error-first order. Let Dopt denote an optimal sequence offlushes sharing the longest common prefix with DSPEF. Thenthere exists an m such that DOPT and DSPEF differ for the firsttime at index m. If DSPEF and Dopt are not already identical,then m < n where n is the number of unique keys during thewindow, and hence the length of DSPEF.

Assume that Dopt also has length n. (If not, then transformaccording to Lemma 2.) There must then exist an indexm < p ≤ n such that Em(tm) ≥ Ep(tm). In other words, inslot m, Dopt emits a key that does not have the smallestpotential error. By Lemma 1, we can transform Dopt intoD′opt by swapping keys m and p without increasing error orstaleness. In particular, if p = argminm<i≤n Ei(tm), then thisswap yields D′opt that is optimal and identical to DSPEF up toindex m+1. This exposes a contradiction: it could not havebeen true that Dopt had the longest prefix in common withDSPEF. Therefore m cannot be less than n: there must existsome optimal departure sequence identical to DSPEF.

An Alternate Optimal AlgorithmRecall that assigning a key to a slot prior to the first arrivalfor that key, or before the network is available, is equivalentto sending no value for that key; i.e., omitting that key. Bystudying how the SPEF algorithm schedules such omissions,we can derive an alternative optimal algorithm that is moreamenable to emulation in an online setting.

Definition 3 (SPEF with Early Omissions). Let m be thenumber of keys that the SPEF algorithm omits. Then theSPEF with Early Omissions (SPEF-EO) algorithm firstomits the m keys with the smallest initial prefix errors, thenassigns the remaining n−m keys in SPEF order.

Lemma 3. A key omitted by the SPEF algorithm has, at thetime it is omitted, the smallest initial prefix error among allnot-yet-assigned keys.

Theorem 3. The SPEF-EO algorithm is optimal.

Proof. By Lemma 3, the final key omitted by the SPEF al-gorithm contributes the greatest error of all prior omissions,and this error is no less than the mth-smallest initial prefixerror. Error therefore cannot be reduced by flushing (i.e., not

365

Page 6: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

omitting) any of the m keys with the smallest initial prefixerrors.

The remaining n−m slots occur no earlier than the n−mslots carrying non-zero updates in the original algorithm. Bymonotonicity, assigning the remaining n−m keys to theseslots therefore cannot increase error.

3.3 High-Level LessonsThese offline optimal algorithms, although not applicablein practical online settings, leave us with several high-levellessons. First, they flush each key at most once, therebyavoiding wasting scarce network resources. Second, theymake the best possible use of network resources. In the error-bound case, EPE achieves this by sending only keys thathave already satisfied the error constraint. For the staleness-bound case, with SPEF and SPEF-EO, this means using eachunit of network capacity (i.e., each slot) to send the most up-to-date value for the key with the minimum potential error,and for SPEF-EO, sending only those keys with the largestinitial prefix errors.

4. Online AlgorithmsWe are now prepared to consider practical online algorithmsfor achieving near-optimal staleness-error tradeoffs. The of-fline optimal algorithms from Section 3 serve as useful base-lines for comparison, and they also provide models to emu-late. We first present our practical online algorithms and thenpresent a prototype implementation on Apache Storm.

4.1 The Two-Level Cache AbstractionExact ComputationOur online algorithms for approximate windowed groupedaggregation generalize those for exact computation [22],where we view the edge as a cache of aggregates and em-ulate offline optimal algorithms through cache sizing andeviction policies. When a record arrives at the edge, it is in-serted into the cache, and its value is merged via aggregationwith any existing aggregate sharing the same key. The siz-ing policy, by dictating how many aggregates may reside inthe cache, determines when aggregates are evicted. For ex-act computation, an ideal sizing policy allows aggregates toremain in the cache until they have reached their final value,while avoiding holding them so long as to lead to high stale-ness. The eviction policy determines which key to evict, andan ideal policy for exact computation selects keys with nofuture arrivals. Upon eviction, keys are enqueued to be trans-mitted over the WAN, which we assume services this queuein FIFO order.

Approximate ComputationFor approximate windowed grouped aggregation, we con-tinue to view the edge as a cache, but we now partition it intoprimary and secondary caches. The reason for this distinc-tion is that, when approximation is allowed, it is no longer

necessary to flush updates for all keys. An online algorithmmust determine not just when to flush each key, but alsowhich keys to flush. The distinction between the two cachesserves to answer the latter question: updates are flushed onlyfrom the primary cache. It is the role of the cache partition-ing policy to define the boundary between the primary andsecondary cache, and the main difference between our error-bound and staleness-bound online algorithms lies in the par-titioning policy. As we discuss throughout this section, ourerror-bound algorithm defines this boundary based on thevalues of items in the cache, while the staleness-bound al-gorithm uses a dynamic sizing policy to determine the sizeof the primary cache.

In addition, the primary cache logically serves as the out-going network queue: updates are flushed from this cachewhen network capacity is available. Unlike FIFO queuing,this ensures that our online algorithms make the most effec-tive possible use of network resources. In particular, it en-sures that flushed updates always reflect the most up-to-dateaggregate value for each key. Additionally, it allows us touse our eviction policies to choose the most valuable key tooccupy the network at any time.

4.2 Error-Bound AlgorithmsCache Partitioning PolicyOur online error-bound algorithm uses a value-based cachepartitioning policy, which defines the boundary between pri-mary and secondary caches in terms of aggregate values.It emulates the offline optimal EPE algorithm (Section 3.1)that only flushes keys whose prefix error is within the er-ror bound. Specifically, new arrivals are first added to thesecondary cache, and they are promoted into the primarycache only when their aggregate grows to exceed the errorconstraint. More rigorously, let Fk denote the total aggre-gate value flushed for key k so far during the current window(logically zero if no update has been flushed), and let Vk de-note the aggregate value currently maintained in the cachefor key k. Then the accumulated error for key k is definedas error(Fk,Fk⊕Vk); i.e., the error between the value that thecenter currently knows and the value it would see if key kwere flushed.10 Key k is moved from the secondary cache toprimary cache when its accumulated error exceeds E. Giventhis policy, and the fact that updates are only flushed fromthe primary cache, we are guaranteed to flush only keys thatstrictly must be flushed in order to satisfy the error con-straint; i.e., this approach avoids false positives.

After the end of the window, our online algorithm flushesthe entire contents of the primary cache, as all of its con-stituent keys exhibit sufficient accumulated error to violatethe error constraint if not flushed.11

10 Note we can compute this value online.11 The order of these flushes is unimportant, as all of them are required tobring error below E.

366

Page 7: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

Cache Eviction PolicyPrior to the end of the window, some prediction is involved:When the network pulls an update from the primary cache,the eviction policy must identify a key that has reached anaggregate value within E of its final value. We explore sev-eral alternatives [23] and find that well known cache replace-ment policies—in particular least-recently-used (LRU)—work well. Intuitively, LRU assumes that the key with theleast recent arrival is also the least likely to receive futurearrivals, and hence closest to its final value. Therefore LRUevicts the key with the least recent arrival.

4.3 Staleness-Bound AlgorithmsAn online algorithm that aims to minimize error under astaleness constraint faces slightly different challenges. Inparticular, we can no longer define the boundary betweenprimary and secondary caches by value, as we do not knowa priori what this value should be. Instead our staleness-bound online algorithm uses a dynamic sizing-based cachepartitioning policy to emulate the offline optimal SPEF-EOalgorithm (Section 3.2). To understand this approach, recallfrom Definition 3 that the SPEF-EO algorithm flushes onlythe keys with the largest initial prefix errors. Given that therole of the primary cache is to contain the keys that must beflushed, we emulate this behavior by 1) dynamically rankingcached keys by their (estimated) initial prefix error, and then2) defining the primary cache at time t to comprise thetop σ(t) keys in this order. In practice, this is challengingas we must predict both initial prefix errors as well as anappropriate sizing function σ(t).

Initial Prefix Error PredictionDuring the window, there are many ways to predict initialprefix errors. One straightforward approach is to use accu-mulated error (Section 4.2) as a proxy for initial prefix error.We call this the Acc policy for short. Intuitively, this policyassumes that the keys that have accumulated the most errorso far also have the largest initial prefix errors, and there-fore ranking keys by accumulated error is a good approxi-mation for ranking them by initial prefix error. We find thispolicy performs better than more sophisticated policies suchas those that try to predict initial prefix error explicitly [23].

Cache Size PredictionPredicting the appropriate primary cache size function σ(t)raises several additional challenges. To begin, consider howto define this function in an ideal world where we have aperfect eviction policy and a perfect prediction of initial pre-fix error. At time t, the primary cache size σ(t) representsthe number of keys that are present in the primary cache.These σ(t) keys will need to be flushed prior to the stalenessbound, as will any not-yet-arrived keys with sufficiently highinitial prefix error. Let f (t) denote the number of future ar-rivals of these large-prefix-error keys. Then, if the total num-ber of network slots remaining until the staleness constraint

is given by B(t), an ideal primary cache size function σ(t)satisfies B(t) = σ(t)+ f (t).

In practice, we have neither a perfect eviction policy, nora perfect prediction of initial prefix error, and determiningf (t) is a nontrivial prediction on its own. We have exploreda host of alternative approaches [23], and among these, wefind that a policy based on the previous window’s arrival se-quence works well. This policy, referred to as PrevWindow,assumes that arrivals in the current window will resemblethose in the previous window, and it therefore uses the previ-ous window’s arrival sequence to compute f (t). Using an ex-ponentially weighted moving average of the available WANbandwidth to predict B(t), this f (t) is then used to computethe appropriate primary cache size function σ(t).

Cache Eviction PolicyDuring the window, we use the sizing-based cache partition-ing policy to delineate the boundary between primary andsecondary caches. When network capacity is available, itremains the role of the cache eviction policy to determinewhich key should be evicted from the primary cache. As inSection 4.2, we again find that simple well known policiessuch as LRU perform well.

Upon reaching the end of the window, there is no needfor prediction since all final aggregate values are known. Atthis point, keys are evicted from the union of the primary andsecondary caches in descending order by their accumulatederror until reaching the staleness bound.

4.4 ImplementationWe implement our algorithms in Apache Storm [3], extend-ing our prototype for exact computation [22], which uses adistinct Storm cluster at each edge, as well as one at the cen-ter, in order to distribute the aggregation workload. Figure 2shows the overall architecture. Here we briefly discuss eachcomponent in the order of data flow from edge to center.

Center

SocketReceiver Reorderer StatsCollectorEdgeSummerEdgeSummerEdgeSummerAggregator

Edge

Replayer Reorderer

Logs

EdgeSummerEdgeSummerEdgeSummerAggregator

SocketSender

Figure 2. Aggregation is distributed over Apache Stormclusters at each edge as well as at the center.

Edge. Data enters our prototype at the edge through theReplayer spout, which replays timestamped logs from afile. Each line is parsed using a query-specific parsing func-tion to produce a (timestamp, key, value) triple. Our

367

Page 8: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

implementation leverages Twitter’s Algebird12 library tosupport a broad set of value types and associated aggrega-tions. The Replayer emits records according to their times-tamp; i.e., event time and processing time [7] are equivalentat the Replayer. The Replayer emits records downstream,and also periodically emits punctuation messages to indicatethat no messages with earlier timestamps will be sent in thefuture.

The next step in the dataflow is the Aggregator, forwhich one or more tasks run at each cluster. The Aggregatordefines window boundaries in terms of record timestamps,and maintains the two-level cache from Section 4, with eachtask aggregating a hash-partitioned subset of the key space.We implement the two-part cache using efficient priorityqueues, allowing us to generalize over a range of cache par-titioning and eviction policies in order to implement ouronline algorithms.

The Aggregator tasks send their output to a single in-stance of the Reorderer bolt, which is responsible for de-laying records as necessary in order to maintain punctua-tion semantics. Data then flows into the SocketSender bolt,which transmits (partial) aggregates to the center using TCPsockets.

Center. At the center, data flows in the reverse order. First,the SocketReceiver spout receives partial aggregates andpunctuations and emits them downstream into a Reorderer,where the streams from multiple edges are synchronized.From there, records flow into the central Aggregator, eachtask of which is responsible for performing the final ag-gregation over a hash-partitioned subset of the key space.Upon completing aggregation for a window, these centralAggregator tasks emit summary metrics including errorand staleness, and these metrics are summarized by the fi-nal StatsCollector bolt. Staleness is computed relative tothe wall-clock (i.e., processing) time at which the windowcloses. Clocks are synchronized using NTP.13

Our online algorithms treat the two-part cache as theoutgoing network queue, but Storm does not implement flowcontrol at the bolt level, and the Storm programming modelonly allows bolts to push tuples downstream. To resolve thisapparent mismatch, we track queuing delays within both theSocketSender and the SocketReceiver. These delays arecommunicated upstream, and the edge Aggregators usethem as congestion signals, applying an additive-increase/ multiplicative-decrease (AIMD) approach to dynamicallyadjust the rate at which they push records downstream.

5. EvaluationIn this section, we evaluate our online algorithms usingboth a trace-driven simulation as well as experiments us-

12 https://github.com/twitter/algebird13 Clock synchronization errors are small relative to the staleness ranges weexplore in our experiments.

ing our Storm-based implementation on PlanetLab [1] andlocal testbeds.

5.1 Dataset and QueriesOur simulations and experiments are driven by an anonymizedworkload trace obtained from a real-world analytics ser-vice14 by Akamai, a large commercial content delivery net-work. This analytics service is used by content providers totrack important metrics about who has downloaded content,where these clients were located, what performance theyexperienced, how many downloads completed successfully,etc. The data source is a software called Download Manager,which is installed on mobile devices, laptops, and desktopsof millions of users around the world, and used to down-load software updates, security patches, music, games, andother content. The Download Managers installed on users’devices around the world send information about the down-loads to the widely-deployed Akamai edge servers using“beacons”.15 Each download results in one or more beaconsbeing sent to an edge server, and these beacons contain in-formation about the time the download was initiated, url,content size, number of bytes downloaded, user’s ip, user’snetwork, user’s geography, server’s network and server’sgeography. We use the anonymized beacon logs from Aka-mai’s download analytics service for the month of Decem-ber, 2010. Note that, for confidentiality reasons, we normal-ize derived values from the data set such as time durationsand error values.

In our evaluation, we focus on windowed grouped aggre-gation for a query that groups its input records by contentprovider id, client country code, and url. Because this lastdimension—url—can take on hundreds of thousands of dis-tinct values, we call this a “large” query. Our previous workon exact windowed grouped aggregation [22] also considers“smaller” queries; we focus here on the largest queries asthey are especially challenging under bandwidth constraints.

We consider three diverse aggregations: Sum, Max, andHyperLogLog [18]. Each of these aggregations uses thelarge query key. The Sum aggregation computes the to-tal number of bytes successfully downloaded for each key,while the Max aggregation computes the largest successfuldownload size for each key. The HyperLogLog aggregationis used to approximate the number of unique client IP ad-dresses for each key. This demonstrates an important point:our techniques apply to a broad range of aggregations, fromdistributive and algebraic [20] aggregates such as Sum, Max,or Average to holistic aggregates such as unique count viaHyperLogLog.16

14 http://www.akamai.com/dl/feature_sheets/Akamai_

Download_Analytics.pdf15 A beacon is simply an http GET issued by the Download Manager for asmall GIF containing the reported values in its url query string.16 We adopt the common approach [15] of approximating holistic aggre-gates (e.g., unique count, Top-K, heavy hitters) using sketch data structures(e.g., HyperLogLog, CountMinSketch [14])

368

Page 9: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

5.2 Baseline AlgorithmsWe compare the following aggregation algorithms.

Streaming: A streaming algorithm performs no aggrega-tion at the edge, and instead simply flushes each key im-mediately upon arrival; that is, it streams arrivals directlyon to the center. An error-bound variant continues stream-ing arrivals until the error constraint E is satisfied, while astaleness-bound variant continues streaming until reachingthe staleness limit S.

Batching: A batching algorithm aggregates arrivals until theend of the window without sending any updates. At the endof the window at time T +W , an error-bound variant flushesall keys for which omitting an update would exceed theerror bound; i.e., all keys with initial prefix error greaterthan E. A staleness-bound variant begins flushing keys atthe end of the window, and does so in descending orderof initial prefix error, stopping upon reaching the stalenesslimit S.

Batching with Random Early Updates (Random): TheRandom algorithm effectively combines batching withstreaming using random sampling. Concretely, the Ran-dom algorithm aggregates arrivals at the edge as batchingdoes, but it sends updates for random keys during the win-dow whenever network capacity is available. When the endof the window arrives, it flushes keys as batching does, indecreasing order by their impact on error. This algorithmis a useful baseline to show that the choice of which key toflush during the window is critical.

Optimal: To provide an optimal baseline for evaluating ouronline algorithms, we implement the EPE and SPEF-EO al-gorithms from Section 3 for the error-bound and staleness-bound optimizations, respectively.

Caching: These refer to our caching-based online algo-rithms presented in Section 4. For the error-bound case,we use the emulated EPE algorithm, consisting of the two-part caching approach with an LRU primary cache evictionpolicy, as discussed in Section 4.2. For the staleness-boundcase, we employ the emulated SPEF-EO algorithm dis-cussed in Section 4.3, consisting of LRU eviction, Accu-mulated Error-based estimation, and PrevWindow cachesizing policies.

5.3 Simulation ResultsBefore presenting experimental results on our PlanetLab de-ployment, we use simulations to compare our online algo-rithms to a number of baseline algorithms. Simulation allowsus to more rapidly explore a large design space, to compareagainst an offline optimal algorithm, and to select a practicalbaseline approach to use in our experiments later.

We implement a simple discrete event simulator in Python.Our simulation is deliberately simplified, as the focus hereis not on understanding performance in absolute terms (we

rely on the experiments in Section 5.4 for that) but ratherto compare the tradeoffs between different algorithms. Notethat error for a schedule of flushes can be computed directlyfrom its definition; it is only the staleness that we simulate.To compute staleness, we model the wide-area network as aqueue with deterministic service times based on a constantnetwork bandwidth. Arrival times to this queue correspondto the time at which updates are flushed. An aggregation al-gorithm is then simulated by scheduling the updates basedon the algorithm’s schedule. All simulation results presentmedian values (staleness or error) over 25 consecutive timewindows from the Akamai trace, spanning multiple daysworth of workload.

5.3.1 Minimizing Staleness (Error-Bound)We first compare our caching-based online algorithm againstthe baseline algorithms for the error-bound optimization al-gorithm which has the goal of minimizing staleness for agiven error constraint. Figure 3 shows staleness (normalizedrelative to the window length) for a range of error boundsE (normalized relative to the largest possible error for ourtrace) for our online algorithm, as well as for our three base-line algorithms and an optimal offline algorithm. It is clearthat streaming is infeasible, as staleness far exceeds the win-dow length (shown as the dotted horizontal line). By per-forming aggregation at the edge, batching significantly im-proves upon streaming, but it still yields high staleness be-cause it delays all flushes until the end of the window. Ran-dom improves only slightly over batching, showing the limi-tation of our random sampling approach. Our caching-basedonline algorithm, on the other hand, chooses which keys toflush during the window in a principled manner, and yieldsstaleness much closer to an offline optimal algorithm as aresult.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Error Limit ×10−4

10−510−410−310−210−1

100101102103104

Nor

mal

ized

Stal

enes

s

OptimalBatching

RandomStreaming

Caching

Figure 3. Normalized staleness for Sum aggregation withvarious error bounds. Note the logarithmic y-scale.

5.3.2 Minimizing Error (Staleness-Bound)Figure 4 demonstrates the effectiveness of our approachcompared to the other baselines for the case where we

369

Page 10: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

10−3 10−2 10−1 100

Normalized Staleness Limit

10−5

10−4

10−3

10−2

10−1

100

101

102

103

Nor

mal

ized

Err

or

OptimalBatching

RandomStreaming

Caching

Figure 4. Normalized error for Sum aggregation with vari-ous staleness bounds. Note the logarithmic axis scales.

are minimizing the error given a staleness limit. The fig-ure shows the normalized error values obtained for differentstaleness limits. We again see that streaming is impracti-cal, as its fails to effectively use the limited WAN capacityto transmit the most important updates. Batching yields asignificant improvement over streaming, as long as the tol-erance for staleness is sufficiently high. Batching with ran-dom early updates represents a further improvement, thoughagain, the benefit of these early updates is not dramatic. Ourcaching-based online algorithm, on the other hand, by mak-ing principled decisions about which keys to flush duringthe window, yields near-optimal error across a broad rangeof staleness limits.

Note that all algorithms except streaming converge tothe same error as the staleness bound reaches one windowlength. The reason is that, as the staleness bound approachesthe window length, there are fewer and fewer opportunitiesto flush updates during the window, and these algorithmsconverge toward pure batching.

5.4 Experimental ResultsTo demonstrate how our techniques perform in a real-world setting, we now present results from an experimentalevaluation conducted on PlanetLab [1] using our Storm-based implementation. Throughout our experiments, we usetwo combinations of PlanetLab nodes at the University ofOregon as well as local nodes at the University of Min-nesota. The Oregon site includes three nodes, each with4GB of physical memory and four CPU cores. The Min-nesota site includes up to eight nodes, each with 12GB ofphysical memory and eight CPU cores. WAN bandwidthfrom Minnesota to Oregon averages 12.5Mbps based oniperf3 measurements. For all of our experiments, we usethe anonymized Akamai trace as input.

Staleness and error are measured at the center, and wetreat the staleness constraint as a hard deadline: when thedeadline is reached, we compute the error based on recordsthat have arrived at the center up to that point. In general,there may be arrivals after the deadline due to varying WAN

delays, but we disregard these late arrivals as they violate thestaleness constraint.

Based on our results from Section 5.3, we use Randomas a practical baseline algorithm for comparison, since itoutperforms both batching and streaming.

5.4.1 Geo-Distributed AggregationIn order to demonstrate our techniques in a real geo-distributedsetting, we begin with experiments using Minnesota nodesas the edge and Oregon nodes as the center. Note that we uselocal Minnesota nodes rather than PlanetLab nodes as theedge because PlanetLab imposes restrictive daily limits onthe amount of data sent from each node, rendering repeatedexperiments with PlanetLab edge nodes impractical.17

Figure 5(a) and Figure 5(b) show staleness and error forerror-constrained and staleness-constrained algorithms, re-spectively. Both staleness and error are normalized relativeto the Random baseline for ease of interpretation. We con-sider the median staleness or error during a relatively stableportion of the workload, and plot the mean over five runs.Error bars show 95% confidence intervals [17].

Figure 5 demonstrates that our Caching algorithms sig-nificantly outperform a practical random sampling/batching-based baseline in a real-world geo-distributed deployment.Staleness is reduced by 81.8% to 96.6% under an error con-straint, while error is reduced by 83.4% to 99.1% under astaleness constraint. Note that the reduction is largest for Maxaggregation, as many keys attain their final value well in ad-vance of the end of the window, allowing greater opportunityfor early evictions.

These results demonstrate that our techniques performwell in a real geo-distributed deployment over real data, andthey do so for a diverse set of aggregations.

5.4.2 Dynamic Workload and WAN BandwidthTo dig deeper, we assess how our algorithms perform underdynamic workload and WAN bandwidth using an alterna-tive deployment with eight total Minnesota nodes: four asthe edge and four as the center. We use tc to emulate con-strained WAN bandwidth between the edge and center clus-ters. By changing the emulated WAN bandwidth at runtime,we can study how our algorithms adapt to dynamic WANbandwidth. Further, our anonymized Akamai trace exhibitsa significant shift in workload early in the trace, providinga natural opportunity to study performance under workloadvariations.

Figure 6 shows how our staleness-bound and error-boundalgorithms behave under these workload and bandwidthchanges for the Sum aggregation. Figure 6(a) plots stalenessnormalized relative to the window length, while Figure 6(b)plots error normalized relative to the largest possible error.

17 For example, each one of the five repetitions of the HyperLogLog exper-iments was enough to exhaust the daily bandwidth limit.

370

Page 11: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

Sum Max HyperLogLog

Aggregation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4N

orm

aliz

edSt

alen

ess Random Caching

(a) Staleness under an error constraint.

Sum Max HyperLogLog

Aggregation

0.00.20.40.60.81.01.21.41.6

Nor

mal

ized

Err

or

Random Caching

(b) Error under a staleness constraint.

Figure 5. Comparison of Random and Caching algorithms for various aggregation functions on a PlanetLab testbed.

0 5 10 15 20 25 30 35 40

Window Number

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Stal

enes

s RandomCaching

(a) Staleness under an error constraint.

0 5 10 15 20 25 30 35 40

Window Number

0.00.20.40.60.81.01.21.4

Nor

mal

ized

Err

or

×10−5

RandomCaching

(b) Error under a staleness constraint.

Figure 6. Comparison of Random and Caching algorithms for Sum aggregation under dynamic workload and WAN bandwidthon a local testbed. Arrival rate increases during window 6 (dashed vertical line), and available bandwidth decreases duringwindow 20 (dotted vertical line).

For both figures, we run the experiment five times, and plotthe median staleness or error for each window.

We see that both Random and Caching algorithms requiresome time to adapt to the workload change that occurs dur-ing window 6. Once stabilized, Caching yields significantlylower staleness and error than Random. During window 20,we reduce WAN bandwidth by 62.5%. While staleness anderror increase for both algorithms, our Caching algorithmscontinue to significantly outperform the Random baseline.

5.4.3 Various Error and Staleness ConstraintsUsing the eight-node Minnesota deployment, we furtherstudy the performance of our algorithms under various errorand staleness constraints. Figure 7 shows staleness (normal-ized relative to the window length) and error (normalizedrelative to the largest possible error) for the Sum aggrega-tion. As in Figure 5, we plot 95% confidence intervals overfive runs.

As we saw from our simulation results in Section 5.3,we see that higher error (resp., staleness) constraints lead tolower staleness (resp., error), demonstrating the tradeoff be-tween staleness and error in a real deployment. Our Caching

algorithms significantly outperform Random across a rangeof error and staleness constraints,

Overall, our experiments demonstrate that our Cachingalgorithms apply across a diverse set of aggregations, andthat they outperform the Random baseline in real geo-distributed deployments, under dynamic workloads andbandwidth conditions, and across a range of error and stale-ness constraints.

6. Related WorkNumerous streaming systems [2, 6, 11, 12, 26, 31, 36] havebeen proposed in recent years. These systems provide manyuseful ideas for new analytics systems to build upon, but theydo not fully explore the challenges described in this paper,in particular how to strike a near-optimal balance betweentimeliness and accuracy.

Google recently proposed the Dataflow model [7] as aunified abstraction for computing over both bounded andunbounded datasets. Our work fits within this model, but wefocus in particular on aggregation over tumbling windowson unbounded data sets.

Wide-area computing has received increased research at-tention in recent years, due in part to the widening gap be-

371

Page 12: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

7.8e-08 1.6e-07 3.1e-07 6.2e-07

Normalized Error Limit

0.0

0.1

0.2

0.3

0.4

Nor

mal

ized

Stal

enes

s Random Caching

(a) Staleness under various error constraints.

0.05 0.1 0.2 0.4

Normalized Staleness Limit

0123456

Nor

mal

ized

Err

or

×10−6

Random Caching

(b) Error under various staleness constraints.

Figure 7. Comparison of Random and Caching algorithms for Sum aggregation with various error and staleness constraints ona local testbed.

tween data processing and communication costs. Much ofthis attention has been paid to batch computing [30, 33, 34].Relatively little work on streaming computation has focusedon wide-area deployments, or associated questions such aswhere to place computation. Pietzuch et al. [29] optimizeoperator placement in geo-distributed settings to balance be-tween system-level bandwidth usage and latency. Hwang etal. [25] rely on replication across the wide area in order toachieve fault tolerance and reduce straggler effects.

JetStream [32] considers wide-area streaming computa-tion, and like our work, addresses the tension between time-liness and accuracy. Unlike our work, however, it focuses ata higher level on the appropriate abstractions for navigatingthis tradeoff. Meanwhile BlinkDB [5] provides mechanismsto trade accuracy and response time, though it does not focuson processing streaming data. Our focus, on the other hand,is on concrete algorithms to approach an optimal tradeoffbetween timeliness and accuracy in streaming settings.

Das et al. [16] consider tradeoffs between throughput andlatency in Spark Streaming, but they focus on exact compu-tation and consider only a uniform batching interval for theentire stream, while we consider approximate computationand address scheduling on a per-key basis.

Aggregation is a key operator in analytics, and groupedaggregation is supported by many data-parallel program-ming models [10, 20, 35]. Larson et al. [27] explore the ben-efits of performing partial aggregation prior to a join opera-tion, much as we do prior to network transmission. Whilethey also recognize similarities to caching, they consideronly a simple fixed-size cache, whereas our approach uses anovel two-part cache to determine both whether and when totransfer partial aggregates. Amur et al. [8] study grouped ag-gregation, focusing on the design and implementation of ef-ficient data structures for batch and streaming computation,though they do not consider staleness, a key performancemetric in our work.

Our work bears some resemblance to anytime algorithms(e.g., [9, 37]), which improve their solutions over time andcan be interrupted at any point to deliver the current best

solution. While our staleness-bound algorithm is essentiallyan anytime algorithm, terminating when the staleness limitis reached, our error-bound algorithm terminates only whenthe error constraint is satisfied.

Our techniques complement other approaches for approx-imation in stream computing. For example, our work ap-plies to approximate value types such as the Count-minSketch [14] or HyperLogLog [18], while recent work in dis-tributed function monitoring [13, 19] could serve as a start-ing point for coordination between multiple edges.

7. ConclusionIn this paper, we considered the problem of streaming ana-lytics in a geo-distributed environment. Due to WAN band-width constraints, applications must often sacrifice eithertimeliness by allowing stale results, or accuracy by allowingsome error in final results. In this paper, we focused on win-dowed grouped aggregation, an important and widely usedprimitive in streaming analytics, and studied the tradeoff be-tween the key metrics of staleness and error. We presentedoptimal offline algorithms for minimizing staleness under anerror constraint and for minimizing error under a stalenessconstraint. Using these offline algorithms as references, wepresented practical online algorithms for effectively tradingoff timeliness and accuracy in the face of bandwidth limita-tions. Using a workload derived from a web analytics serviceoffered by a large commercial CDN, we demonstrated the ef-fectiveness of our techniques through both trace-driven sim-ulation as well as experiments using an Apache Storm-basedprototype implementation deployed on PlanetLab. Our re-sults showed that our proposed algorithms outperform prac-tical baseline algorithms for a range of error and stalenessbounds, for a variety of aggregation functions, and undervaried network bandwidth and workload conditions.

AcknowledgmentsThe authors would like to acknowledge NSF Grant CNS-1413998, and an IBM Faculty Award, which supported thisresearch.

372

Page 13: Trading Timeliness and Accuracy in Geo-Distributed ... · ern streaming analytics systems therefore face the daunt-ing task of ingesting massive amounts of data and perform-ing computation

References[1] PlanetLab. http://planet-lab.org/, 2015.

[2] Apache Flink: Scalable Batch and Stream Data Processing.http://flink.apache.org/, 2016.

[3] Apache Storm. http://storm.apache.org/, 2016.

[4] M. Adler, R. K. Sitaraman, and H. Venkataramani. Algorithmsfor optimizing the bandwidth cost of content delivery. Com-puter Networks, 55(18):4007–4020, Dec. 2011. ISSN 1389-1286.

[5] S. Agarwal et al. BlinkDB: queries with bounded errorsand bounded response times on very large data. In Proc. ofEuroSys, pages 29–42, 2013.

[6] T. Akidau et al. MillWheel: Fault-tolerant stream processingat internet scale. Proc. of VLDB Endow., 6(11):1033–1044,Aug. 2013.

[7] T. Akidau et al. The dataflow model: A practical approachto balancing correctness, latency, and cost in massive-scale,unbounded, out-of-order data processing. Proc. of the VLDBEndowment, 8:1792–1803, 2015.

[8] H. Amur, W. Richter, D. G. Andersen, M. Kaminsky,K. Schwan, A. Balachandran, and E. Zawadzki. Memory-efficient groupby-aggregate using compressed buffer trees. InProc. of SoCC, pages 18:1–18:16, 2013.

[9] M. Boddy. Anytime problem solving using dynamic program-ming. In Proc. of AAAI, pages 738–743, 1991.

[10] O. Boykin, S. Ritchie, I. O’Connel, and J. Lin. Summing-bird: A framework for integrating batch and online mapreducecomputations. In Proc. of VLDB, volume 7, pages 1441–1451,2014.

[11] S. Chandrasekaran et al. TelegraphCQ: Continuous dataflowprocessing for an uncertain world. In Proc. of CIDR, 2003.

[12] G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei, N. Simha,W. Wang, K. Wilfong, T. Williamson, and S. Yilmaz. Realtimedata processing at facebook. In Proc. of SIGMOD, pages1087–1098, 2016.

[13] G. Cormode. Continuous distributed monitoring: A shortsurvey. In Proc. of AlMoDEP, pages 1–10, 2011.

[14] G. Cormode and S. Muthukrishnan. An improved data streamsummary: The count-min sketch and its applications. J. Algo-rithms, 55(1):58–75, Apr. 2005. ISSN 0196-6774.

[15] G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan,O. Spatscheck, and D. Srivastava. Holistic UDAFs at stream-ing speeds. In Proc. of SIGMOD, pages 35–46, 2004.

[16] T. Das, Y. Zhong, I. Stoica, and S. Shenker. Adaptive streamprocessing using dynamic batch sizing. In Proc. of SoCC,pages 16:1–16:13, 2014.

[17] B. Efron. Better bootstrap confidence intervals. Journal of theAmerican Statistical Association, 82(397):171–185, 1987.

[18] P. Flajolet, E. Fusy, O. Gandouet, et al. HyperLogLog: Theanalysis of a near-optimal cardinality estimation algorithm. InProc. of AOFA, 2007.

[19] N. Giatrakos, A. Deligiannakis, and M. Garofalakis. Scal-able approximate query tracking over highly distributed datastreams. In Proc. of SIGMOD, pages 1497–1512, 2016.

[20] J. Gray et al. Data cube: A relational aggregation operatorgeneralizing group-by, cross-tab, and sub-totals. Data Min.Knowl. Discov., 1(1):29–53, Jan. 1997.

[21] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. Thecost of a cloud: Research problems in data center networks.SIGCOMM Comput. Commun. Rev., 39(1):68–73, Dec. 2008.ISSN 0146-4833.

[22] B. Heintz, A. Chandra, and R. K. Sitaraman. Optimizinggrouped aggregation in geo-distributed streaming analytics. InProc. of HPDC, pages 133–144, 2015.

[23] B. Heintz, A. Chandra, and R. K. Sitaraman. Trading time-liness and accuracy in geo-distributed streaming analytics.Technical Report 16-003, Department of Computer Science,University of Minnesota, 2016.

[24] C.-C. Hung, L. Golubchik, and M. Yu. Scheduling jobs acrossgeo-distributed datacenters. In Proc. of SoCC, pages 111–124,2015.

[25] J.-H. Hwang, U. Cetintemel, and S. Zdonik. Fast and highly-available stream processing over wide area networks. In Proc.of ICDE, pages 804–813, 2008.

[26] S. Kulkarni et al. Twitter heron: Stream processing at scale.In Proc. of SIGMOD, pages 239–250, 2015.

[27] P.-A. Larson. Data reduction by partial preaggregation. InProc. of ICDE, pages 706–715, 2002.

[28] E. Nygren, R. K. Sitaraman, and J. Sun. The Akamai net-work: a platform for high-performance internet applications.SIGOPS Oper. Syst. Rev., 44(3):2–19, Aug. 2010.

[29] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos,M. Welsh, and M. Seltzer. Network-aware operator placementfor stream-processing systems. In Proc. of ICDE, 2006.

[30] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella,P. Bahl, and I. Stoica. Low latency geo-distributed data ana-lytics. In Proc. of SIGCOMM, pages 421–434, 2015.

[31] Z. Qian et al. TimeStream: reliable stream computation in thecloud. In Proc. of EuroSys, pages 1–14, 2013.

[32] A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman.Aggregation and degradation in JetStream: Streaming analyt-ics in the wide area. In Proc. of NSDI, pages 275–288, 2014.

[33] A. Vulimiri, C. Curino, B. Godfrey, K. Karanasos, andG. Varghese. WANalytics: Analytics for a geo-distributeddata-intensive world. In Proc. of CIDR, January 2015.

[34] A. Vulimiri, C. Curino, P. B. Godfrey, T. Jungblut, J. Padhye,and G. Varghese. Global analytics in the face of bandwidthand regulatory constraints. In Proc. of NSDI, pages 323–336,May 2015.

[35] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation fordata-parallel computing: interfaces and implementations. InProc. of SOSP, pages 247–260, 2009.

[36] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica.Discretized streams: Fault-tolerant streaming computation atscale. In Proc. of SOSP, pages 423–438, 2013.

[37] S. Zilberstein. Operational rationality through compilation ofanytime algorithms. Technical report, 1993.

373