composite subset measures lei chen, paul barford, bee-chung chen, vinod yegneswaran university of...

Composite Subset Measures

Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran

University of Wisconsin - Madison

Raghu RamakrishnanYahoo! Research and University of Wisconsin – Madison

09.12.2006

Motivation

Consider this query: “For each year and each country, compute the ratio of the

average personal incomes between richest city and poorest city . Then find the number of countries where such ratio continuously decrease between 1990-2000“

It is Hard to write in SQL Hard to optimize/understand the SQL query

This kind of queries is increasingly common: Multi-step aggregation Must scale to very large datasets, often distributed

Contributions

A new framework for expressing such compositional aggregate queries Key contribution is how we look at the

computation, in terms of aggregating over related regions in “cube space”

An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps Experimental results

Background Computing “measures”

Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio)

Approaches: Group by, data cubes, Hancock, Sawzall

Cube space Partition feature space using attribute values; domain

hierarchies organize this space into nested collections of regions

Regions: (2006, Korea), (2006/09, Seoul) Region sets: (Year, Country), (Month, City)

Composite Subset Measures

The measure of a cube region is computed by: Aggregating data in a region directly (e.g.,

sales volumes for each day), or Summarizing the measures for related

regions, e.g.: The maximum of daily volumes within a year The ratio of average personal incomes

between the richest and poorest cities in a country

What is “Related” in Cube Space

Focus on relationships which are commonly used can be efficiently evaluated

Self Parent/Child

E.g., Year/Day

Child/Parent E.g., Day/Year

Sibling E.g., Today/Tomorrow

Examples (Network Analysis)

Data involved: Stream of data records for IP packet information Time (t), Source (U), Destination (D) , Size (s)

Queries: For every minute, the number of outgoing packets

from each given source IP For every hour, the maximum number of minutely

outgoing packets from a given source IP

Expression Algebra

Each measure entity is defined as a collection of region/value pairs Regions should belong to same region set

Fact Table Aggregation Selection Match join Combine join

( )cond T

, ( )G aggg T

,| cond aggS T

1 2( , ,..., )fc nS T T T

Example: Aggregation

For every hour and every unique IP, compute the number of outgoing packets

( : , : ), (*)C t hour U IP countS g D

Example: Selection

For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count)

( : ), (*) 5( )S t hour count M CS g S

Source

Example: Match

For each six hour time window, compute the average of the high traffic count

1 2 2 21 2

( . [ . , . 6]), ( . )|

S S S Savg S SS t S t S t avg S M

Example: Combine

For each hour, compute the ratio between the six hour average and the high traffic count

. / .avg Sratio avg S M S M SS S S

Aggregation Workflows

A diagrammatic way to express multiple composite subset measure expressions Semantically equivalent to the algebra

Rectangles: Region sets Ellipses: Measures associated with the

Region sets Arcs: Computational dependencies

among measures

Example

U:IPt:hour

Countcount(*)

Avg(Count)

Count.t=Sbase.t

t:hour

Region set

Measure name

Aggregation formulaSelection condition

Match condition

Example (cont.)

U:IPt:hour

MAXSmax(s)

MINSmin(s)

RatioMAXS/MINS

Multi-step Execution Plan Evaluation based on the topology order

of the aggregation workflow Materialize non-dependent measures Then evaluate dependent measures

following the arcs of the aggregation workflow

May need to perform join Problem

Intermediate measures: extra I/O

Simple Scan Execution[*]

Build one hash table for each measure “Insert” data into hash tables of low-level

measures Propagate the measures upwards after the

scan is over Distributive or algebraic aggregation function Problem

Each hash table keeps all the entries Bottleneck: Memory capacity

[*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179.

Sort/Scan Execution Simple scan requires large memory

For each hash table, we need to keep all the entries during the scan

When the data is ordered Some hash entries can be flushed out before

the scan is finished The memory footprint can be reduced One pass scan becomes feasible CPU cost is reduced

Evaluation

t:Day t:DayU:IP

t:MonthU:IP

Sort by day

month 1 month 2

Output stream for each hash table is still ordered!

COUNT0count(*)

COUNT2count(*)

COUNT3count(*)

Evaluation

t:Day t:DayU:IP

t:MonthU:IP

Sort by month

month 1 month 2

COUNT0count(*)

COUNT2count(*)

COUNT3count(*)

All the output stream is ordered by month!

Evaluation

t:MonthU:IP

Data are sortedby (t:month, U:IP)

month 1 month 2

COUNT3count(*)

1 1 1 2

By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint

Order and Slack Order

How the records are ordered in the stream E.g., <t:day, U:IP>

Slack The gap between the output stream of the measure

and the scan progress of raw data E.g., <t:day:[-3,+3]>

We have developed a mechanism to Calculate the order/slack Take advantage of the order/slack information during

evaluation

Evaluation Network

order key:<t:Day, U:IP>slack: t:[-1,+1]

order key:<t:Hour, U:IP>slack: t:[-1,+1]order key:<t:Hour, U:IP>

slack: <>

hashtables

Scan sorted data

Optimization How to find a good sort order?

Enumerate all possible orders For each order estimate the memory usage Use sort orders with minimal usage

Evaluation with multiple passes What measure to compute during each

pass? What order to use in each pass?

Experiments 64 million records Synthetic data set Scenario 1

The measures of a region are computed by combining the aggregated measures for different kinds of child region sets

Scenario 2 The measures of a region are computed by

aggregating the measures of multiple chained siblings

Experimental Results (cont.)

2 3 4 5 6

#dependent child measures

SortScan

Experimental Results

2 3 4 5 6 7

Size of the Sibling Chain

SortScan

Conclusions Composite measures as building blocks for

complicated analysis process Algebra provides the semantic foundation Aggregation workflow offers intuitive interface Sort/Scan execution plan evaluates multiple

dependent measures in the same run and hence improve the evaluation performance

composite subset measures lei chen, paul barford, bee-chung chen, vinod yegneswaran university of...

city slide

country slide

join slide

todaytomorrow slide

distributed slide

given source ip slide

high traffic count slide

hour average

Documents

dornsife.usc.edu · marissa chavez nelly chavez...

barford st peter’s c of e primary school (va) 3 year

effects of the localism act should barford st martin have a...

chen chen .vias

oem name make name model name ... - cummins filtration ·...

searching patents using espacenet david barford consultant...

wilkins close | barford | warwickshire

thomas jefferson national accelerator facility page 1 svt...

1 using failure information analysis to detect enterprise...

march 2019 - barford st. john · 2019. 3. 6. · page 3 so...

guofei gu, phillip porras, vinod yegneswaran, martin fong,...

barford neighbourhood development plan 2014...

internet performance dynamics boston university computer...

fullappraisal case study - cooper solutions ·...

sri international technical report an analysis of...

bpd v5 0 - madasafishkroberts/barford/pdf/bpd v5 0.pdf ·...

rupert cottage barford st martin - wiltshire

owen1997 inc chen chen

chen chen planeamiento

honeydv6 a low-interaction ipv6 honeypot - uni …€¦ ·...