making every bit count in wide area analytics ariel rabkin joint work with: matvey arye, siddhartha...
TRANSCRIPT
1
Making Every Bit Count in Wide Area Analytics
Ariel Rabkin
Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai
3
The Rise of Big Distributed Data
• CDNs:– Akamai has ~20 million requests per
second– CloudFlare has about 300 MB/s of logs,
volume doubles every 4 months
• Sensor data (e.g., power grid, highways)
• Smart camera networks
6
High-rate Events can be Costly
Every minute, compute request counts by URL
RequestsRequestsRequestsReques
ts
RequestsRequestsRequestsReques
ts
7
Backhaul has Bad Dynamics
Example: backhaul count of events every 5 minutesChoice of summaries is made upfront statically
• Buyer’s remorse: Chose to collect unnecessary and expensive data
• Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data
8
Local Storage!
Every minute, compute request counts by URL
RequestsRequestsRequestsReques
ts
RequestsRequestsRequestsReques
ts
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
9
Challenge: Bandwidth ScarcityI want the request count for every URL every
second
I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value
cutoff, or change frequency?
I can do that for 900 KB/sec.
Can I get the top 1000 URLs every second?
Great, do it!
10
? ? ? ? ? ? ?
Challenge: Varying Scarcity
Time
Bandw
idth
Needed
Available
Can do
First aggregate over longer time periods, up to 30 seconds. Then
only keep the top URLs.
12
Data Processing Requirements
• Aggregatable
• Merge-able
Data DataMerged
Representation
+ =• Reducible
Data Data
StoredData
+=
Update
13
Raw byte stringse.g. MapReduce
Database tables
High-level API
Merge + Aggregate
Predictable performance
ArbitraryJoins
X X √ X
√ X X √
14
The Data Cube Model
Counts by URL 12:00
12:01
12:02
www.mysite.com
3 5 …
www.yoursite.com
5 4 …
www.hersite.com
8 12 …Roll-up of mysite.com by time from 12:00 to 12:01:
8Roll-up of sites at time
12:00: 16
Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions
Aggregation function used for:• Updates• Roll-ups• Merging cubes• Degrading
cubes
15
Data Cube
Raw byte stringse.g. MapReduce
Database tables
High-level API
Merge + Aggregate
Predictable performance
ArbitraryJoins
X X √ X
√ X X √
√ √ √ X
16
DataflowOperator
s
LocalCube
DataflowOperator
s
Netw
ork
bott
len
eck
DataflowOperator
s
Local Cube
DataflowOperator
s
DataflowOperator
sMerged Cube
Dataflow
Operators
A Vision for Wide-Area Analytics
Dataflow adapted to bandwidth
18
Feedback control
Netw
ork
bott
len
eck
Adaptivity
DataflowOperator
s
Local Cube
DataflowOperator
s
Summarized
Cube
• Key ingredients:– Cube summarization as
mechanism– User-defined policies– Feedback control
20
Conclusions
• The hard problems in wide-area analysis:– Reasoning about bandwidth/data quality
tradeoffs– Optimizing data quality under changing
conditions.– Jointly optimizing bandwidth and other
resources
• We are building a system. –We call it JetStream. Stay tuned….
24 [TeleGeography's Global Bandwidth Research Service]
20% 20%
Frankfurt-
London
2012 Bandwidth Price Shifts
25
Diurnal Load Makes Overprovisioning Expensive
• Leased lines waste capacity during off-peak
• Public internet gets congested during peak
29
Can iteratively pose different queries
RequestsRequestsRequestsRequests
Benefit: Iteration
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
A revised query
30
Can adapt data volume collected to available bw
RequestsRequestsRequestsRequests
Benefit: adaptation
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
Limited Bandwidth
31
Can adapt data volume collected to available bw
RequestsRequestsRequestsRequests
Benefit: adaptation
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
Ample Bandwidth
32
A dataflow model for wide-area analytics
Operator
Cube
Defines data transformation on tuples. Can do input or output.
Structured storage of data
33
Processing SourceCube
Netw
ork
bott
len
eck
Processed Data
Processing SourceCube
Generated data Ingested Into Local cubes