aggregate sharing for user-define data stream windows

38
@CIKM16 Cutty: Aggregate Sharing for User-Defined Windows Paris Carbone <[email protected], [email protected]> Jonas Traub <[email protected]> Asterios Katsifodimos <[email protected]> Seif Haridi <[email protected]> Volker Markl <[email protected]> 1 Presentation : Paris Carbone PhD Candidate @ KTH Sweden Committer @ Apache Flink

Upload: paris-carbone

Post on 16-Apr-2017

275 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty: Aggregate Sharing for User-Defined Windows

Paris Carbone <[email protected], [email protected]> Jonas Traub <[email protected]>Asterios Katsifodimos <[email protected]>Seif Haridi <[email protected]>Volker Markl <[email protected]>

1

Presentation : Paris Carbone PhD Candidate @ KTH Sweden Committer @ Apache Flink

Page 2: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16 4 Reasons

Not to check your email during this talk

1. Windows are the backbone of data stream analysis.

2. We generalise the concept of data stream windows.

3. Our technique makes aggregations on general stream windows more efficient than ever.

4. We can multiplex and share aggregations of diverse types of sliding windows that run simultaneously.

2

Page 3: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Outline

• Partial Sliding Window Aggregation

• Fundamental Limitations of Existing Approaches

• Introducing User-DefinedWindows(UDWs)

•Multi-Query Aggregation of UDWs with Cutty

• Performance Comparison

3

Page 4: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

4

Window Aggregation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

StreamDiscretization

fd

Page 5: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Window Aggregation

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

A1

A2

A3

fa

Page 6: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

lift

record —> (val,count) combine

(val1+val2,count1+count2)

lower

(val,count) —> val/count

Partial Aggregation

6

1 2 3 4 5

1. lift

3. lower

A1

M (2,1)(1,1) 2. combine M

M

M

M

M

3

record typepartial typeaggr type

?

Example - AVG(3,2)

(1,1)

(3,1)

(6,3) (4,1)

(10,4) (5,1)

(15,5)

Page 7: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Partial Aggregation

7

•#Invocations <—> Computational Complexity

•Commutativity & Associativity are typically assumed.

2. combine M

Page 8: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16 Redundancy in Sliding Window Aggregation

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

… …

overlapping means redundant combine calls

we need to optimise…

Page 9: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

slicing

Optimise…which windows?

pre-computenon-overlapping partials

Periodic

Page 10: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Slicing

10

1 2 3 4 5 6 7 8 9 10 11

12

13 14 15 16 17 18 19

Example - Count Window range: 10, slide:3

1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19

If a sliding window can be defined in terms of a fixed range and slide the system can pre-aggregate consecutive, non-overlapping slices.

Panes1gcd(range,slide)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Pairs2p2:range%slidep1:slide-p2

12

periodic windows

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Cutty(preview)1. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams

SIGMOD 2005 2. On-the-Fly Sharing for Streamed Aggregation - SIGMOD 2006

Page 11: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Slicing - Observations

• Computational Complexity —> upd: O(1) ,merge: O(#partials) • Space Complexity (#stored sliced partials):

• Similar space requirements when range is a multiple of slice • Pairs has been extended for multi-query aggregation sharing

11

d range

gcd(range, slide)e

d2⇥ range

slidee

Memory (#partials)

Panes

Pairs

—> 10 partials —> 7 partials

…from previous example

Page 12: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

slicing

Optimise…which windows?

pre-computenon-overlapping partials pre-compute overlapping partials for

arbitrary aggregation lookups

eager pre-aggregation

Non-Periodic

Page 13: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Eager Pre-Aggregation

13

When windowing cannot be expressed simply by a range and slide : eagerly pre-compute partial aggregates and update a binary tree, bottom-up.

1 2 3 4 5 6 7 8

3 7 11 15

10 26

36

9 10

19

30

21

arbitrary window lookups logn}

}n pre-computed partials

n leaves ~ records}}2n

Page 14: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16 Eager Pre-Aggregation

Observations

• Implementations: FlatFAT1, B-Int2

• High Space Complexity (#raw records…twice) • Most pre-aggregates are never used • Update+Aggregation complexity :

• Generic and Suitable for Ad-Hoc Queries • Potential for Multi-Query Window Pre-Aggregation

14

log(leaves)

1.General incremental sliding-window aggregation - VLDB 15 2.Resource sharing in continuous sliding-window aggregates - VLDB 04

Page 15: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

efficientslicing

generic, high-costpre-aggregation

Non-Periodic

Periodic

Page 16: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

efficientslicing

generic, high-costpre-aggregation

Non-DeterministicDeterministic

Page 17: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Deterministic Windows: Intuition

17

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

Page 18: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Deterministic Windows: Intuition

18

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

only need to determine when new windows start

Page 19: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Deterministic Windows: Intuition

19

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

only need to determine when new windows start

Page 20: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

User-Defined Windows

Deterministic: Expressed as a UDF that assigns each record to number of new or complete windows.

20

Trivial templating of existing window types

Non-Deterministic: Expressed as a UDF that assigns a record to complete windows and a reference to their beginning.

Page 21: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty Concept

21

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

1

Slicing

Eager Pre-Aggregation

Page 22: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty Overview

Exploits Deterministic Windows for the most efficient yet aggregation slicing.

Utilises eager pre-aggregation at a low memory cost over optimally sliced partials.

Supports both single and multi-query multiplexed execution out-of-the-box for efficient operator sharing.

Non-Deterministic Windows can still utilise eager pre-aggregation.

22

Page 23: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty Architecture

23

Page 24: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

24

1 2 3 4 5 6 7 8 9 10

-Active Partial

-

-Stored Partials -

- - - -

Records

Windows

Page 25: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

25

1 2 3 4 5 6 7 8 9 10

1Active Partial

-

-Stored Partials -

- - - -

Records

Windows

Page 26: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

26

1 2 3 4 5 6 7 8 9 10

Active Partial

-

-Stored Partials -

- - - -

Records

Windows

3

Page 27: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

27

1 2 3 4 5 6 7 8 9 10

3Active Partial

3

3Stored Partials -

3 - - -

Records

Windows

Page 28: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

28

1 2 3 4 5 6 7 8 9 10

Active Partial

3

3Stored Partials -

3 - - -

Records

Windows

3

Page 29: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

29

1 2 3 4 5 6 7 8 9 10

Active Partial

3

3Stored Partials -

3 - - -

Records

15

Windows

12

Page 30: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

30

1 2 3 4 5 6 7 8 9 10

Active Partial

15

15Stored Partials -

3 12 - -

Records

15

Windows21

6

Page 31: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

31

1 2 3 4 5 6 7 8 9 10

Active Partial

15

15Stored Partials -

3 12 - -

Records

15

Windows21

13

Page 32: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty - Demo

32

1 2 3 4 5 6 7 8 9 10

Active Partial

15

28Stored Partials 13

3 12 13 -

Records

15

Windows21

33

8

Page 33: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Implementation

33

• Apache Flink

•UDW API (Contributed to Apache Flink - 0.9)

•Shared Aggregation Operator (experimental)

•Optimiser collocates parallel windows in operators

• Aggregate Store

•Adaptation of FlatFAT1

•Circular Resizable Buffer Strategies

•Non-Eager Strategy Supported for Experiments

1.General incremental sliding-window aggregation - VLDB 15

Page 34: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Performance AnalysisPeriodic Window Aggregation (DEBS12 dataset)

34

20 40 60 80 100

Number of Queries

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Num

ber

ofPa

rtia

ls

⇥105

CuttyPairs/Pairs+

COUNT-RANGES COUNT-SLIDES0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Num

ber

ofR

ecor

ds

20 40 60 80 100

Number of Queries

0k500k

1000k1500k2000k2500k3000k3500k4000k4500k

Thro

ughp

ut(r

ecor

ds/s

ec)

CuttyPairs+RA

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

104

105

106

107

108

109

1010

1011

Tota

lRed

uce

Cal

ls

Cutty (eager)Pairs+Cutty (lazy)

PairsRANaive

Page 35: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Performance AnalysisSession Window Aggregation (DEBS12 dataset)

35

SESSION LENGTHS0

5000

10000

15000

20000

25000

30000

35000

Num

ber

ofR

ecor

ds

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

103

104

105

106

107

108

109

Tota

lRed

uce

Cal

ls

Cutty (UPD)Cutty (MERGE)

RA (UPD)RA (MERGE)

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

100

101

102

103

104

105

106

Max

Allo

cati

on(#

part

ials

)

Page 36: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

No limits in multiplexing

36

distance [in km]

time

[in min.]

0

0

6 12 18 24

5

10

15

20

Slice Window

Window Begin

Record

1

Page 37: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Summary

• UDWs extend the potential of pre-aggregation in window classes beyond fixed periodic windows.

• Cutty takes slicing a step further in terms of computational efficiency which combines seamlessly with eager aggregation.

• First work that addresses multi-query aggregation across diverse window types.

37

Page 38: Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Thank you!

38

@SenorCarbone

https://flink.apache.org/https://github.com/apache/flink