aggregate sharing for user-define data stream windows
TRANSCRIPT
@CIKM16
Cutty: Aggregate Sharing for User-Defined Windows
Paris Carbone <[email protected], [email protected]> Jonas Traub <[email protected]>Asterios Katsifodimos <[email protected]>Seif Haridi <[email protected]>Volker Markl <[email protected]>
1
Presentation : Paris Carbone PhD Candidate @ KTH Sweden Committer @ Apache Flink
@CIKM16 4 Reasons
Not to check your email during this talk
1. Windows are the backbone of data stream analysis.
2. We generalise the concept of data stream windows.
3. Our technique makes aggregations on general stream windows more efficient than ever.
4. We can multiplex and share aggregations of diverse types of sliding windows that run simultaneously.
2
@CIKM16
Outline
• Partial Sliding Window Aggregation
• Fundamental Limitations of Existing Approaches
• Introducing User-DefinedWindows(UDWs)
•Multi-Query Aggregation of UDWs with Cutty
• Performance Comparison
3
@CIKM16
4
Window Aggregation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 …
1 2 3 4 5
3 4 5 6 7
5 6 7 8 9
StreamDiscretization
fd
@CIKM16
Window Aggregation
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 …
1 2 3 4 5
3 4 5 6 7
5 6 7 8 9
A1
A2
A3
fa
@CIKM16
lift
record —> (val,count) combine
(val1+val2,count1+count2)
lower
(val,count) —> val/count
Partial Aggregation
6
1 2 3 4 5
1. lift
3. lower
A1
M (2,1)(1,1) 2. combine M
M
M
M
M
3
record typepartial typeaggr type
?
Example - AVG(3,2)
(1,1)
(3,1)
(6,3) (4,1)
(10,4) (5,1)
(15,5)
@CIKM16
Partial Aggregation
7
•#Invocations <—> Computational Complexity
•Commutativity & Associativity are typically assumed.
2. combine M
@CIKM16 Redundancy in Sliding Window Aggregation
8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 …
1 2 3 4 5
3 4 5 6 7
5 6 7 8 9
… …
overlapping means redundant combine calls
we need to optimise…
@CIKM16
tumbling
single-typeperiodic Punctuation
SnapshotFCF/CF
Lower-Bound
Session
multi-type
ADWIN
Delta-based
FCA
slicing
Optimise…which windows?
pre-computenon-overlapping partials
Periodic
@CIKM16
Slicing
10
1 2 3 4 5 6 7 8 9 10 11
12
13 14 15 16 17 18 19
Example - Count Window range: 10, slide:3
1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19
If a sliding window can be defined in terms of a fixed range and slide the system can pre-aggregate consecutive, non-overlapping slices.
Panes1gcd(range,slide)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Pairs2p2:range%slidep1:slide-p2
12
periodic windows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Cutty(preview)1. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams
SIGMOD 2005 2. On-the-Fly Sharing for Streamed Aggregation - SIGMOD 2006
@CIKM16
Slicing - Observations
• Computational Complexity —> upd: O(1) ,merge: O(#partials) • Space Complexity (#stored sliced partials):
• Similar space requirements when range is a multiple of slice • Pairs has been extended for multi-query aggregation sharing
11
d range
gcd(range, slide)e
d2⇥ range
slidee
Memory (#partials)
Panes
Pairs
—> 10 partials —> 7 partials
…from previous example
@CIKM16
tumbling
single-typeperiodic Punctuation
SnapshotFCF/CF
Lower-Bound
Session
multi-type
ADWIN
Delta-based
FCA
slicing
Optimise…which windows?
pre-computenon-overlapping partials pre-compute overlapping partials for
arbitrary aggregation lookups
eager pre-aggregation
Non-Periodic
@CIKM16
Eager Pre-Aggregation
13
When windowing cannot be expressed simply by a range and slide : eagerly pre-compute partial aggregates and update a binary tree, bottom-up.
1 2 3 4 5 6 7 8
3 7 11 15
10 26
36
9 10
19
30
21
…
arbitrary window lookups logn}
}n pre-computed partials
n leaves ~ records}}2n
@CIKM16 Eager Pre-Aggregation
Observations
• Implementations: FlatFAT1, B-Int2
• High Space Complexity (#raw records…twice) • Most pre-aggregates are never used • Update+Aggregation complexity :
• Generic and Suitable for Ad-Hoc Queries • Potential for Multi-Query Window Pre-Aggregation
14
log(leaves)
1.General incremental sliding-window aggregation - VLDB 15 2.Resource sharing in continuous sliding-window aggregates - VLDB 04
@CIKM16
tumbling
single-typeperiodic Punctuation
SnapshotFCF/CF
Lower-Bound
Session
multi-type
ADWIN
Delta-based
FCA
efficientslicing
generic, high-costpre-aggregation
Non-Periodic
Periodic
@CIKM16
tumbling
single-typeperiodic Punctuation
SnapshotFCF/CF
Lower-Bound
Session
multi-type
ADWIN
Delta-based
FCA
efficientslicing
generic, high-costpre-aggregation
Non-DeterministicDeterministic
@CIKM16
Deterministic Windows: Intuition
17
Slices
Higherorder
partials
price[in USD]
time[in min.]
0
0
5 10 15 20 25 31 35
10
WindowWindow BeginThresholdPre-Aggregate
@CIKM16
Deterministic Windows: Intuition
18
Slices
Higherorder
partials
price[in USD]
time[in min.]
0
0
5 10 15 20 25 31 35
10
WindowWindow BeginThresholdPre-Aggregate
only need to determine when new windows start
@CIKM16
Deterministic Windows: Intuition
19
Slices
Higherorder
partials
price[in USD]
time[in min.]
0
0
5 10 15 20 25 31 35
10
WindowWindow BeginThresholdPre-Aggregate
only need to determine when new windows start
@CIKM16
User-Defined Windows
Deterministic: Expressed as a UDF that assigns each record to number of new or complete windows.
20
Trivial templating of existing window types
Non-Deterministic: Expressed as a UDF that assigns a record to complete windows and a reference to their beginning.
@CIKM16
Cutty Concept
21
Slices
Higherorder
partials
price[in USD]
time[in min.]
0
0
5 10 15 20 25 31 35
10
WindowWindow BeginThresholdPre-Aggregate
1
Slicing
Eager Pre-Aggregation
@CIKM16
Cutty Overview
Exploits Deterministic Windows for the most efficient yet aggregation slicing.
Utilises eager pre-aggregation at a low memory cost over optimally sliced partials.
Supports both single and multi-query multiplexed execution out-of-the-box for efficient operator sharing.
Non-Deterministic Windows can still utilise eager pre-aggregation.
22
@CIKM16
Cutty Architecture
23
@CIKM16
Cutty - Demo
24
1 2 3 4 5 6 7 8 9 10
-Active Partial
-
-Stored Partials -
- - - -
Records
Windows
@CIKM16
Cutty - Demo
25
1 2 3 4 5 6 7 8 9 10
1Active Partial
-
-Stored Partials -
- - - -
Records
Windows
@CIKM16
Cutty - Demo
26
1 2 3 4 5 6 7 8 9 10
Active Partial
-
-Stored Partials -
- - - -
Records
Windows
3
@CIKM16
Cutty - Demo
27
1 2 3 4 5 6 7 8 9 10
3Active Partial
3
3Stored Partials -
3 - - -
Records
Windows
@CIKM16
Cutty - Demo
28
1 2 3 4 5 6 7 8 9 10
Active Partial
3
3Stored Partials -
3 - - -
Records
Windows
3
@CIKM16
Cutty - Demo
29
1 2 3 4 5 6 7 8 9 10
Active Partial
3
3Stored Partials -
3 - - -
Records
15
Windows
12
@CIKM16
Cutty - Demo
30
1 2 3 4 5 6 7 8 9 10
Active Partial
15
15Stored Partials -
3 12 - -
Records
15
Windows21
6
@CIKM16
Cutty - Demo
31
1 2 3 4 5 6 7 8 9 10
Active Partial
15
15Stored Partials -
3 12 - -
Records
15
Windows21
13
@CIKM16
Cutty - Demo
32
1 2 3 4 5 6 7 8 9 10
Active Partial
15
28Stored Partials 13
3 12 13 -
Records
15
Windows21
33
8
@CIKM16
Implementation
33
• Apache Flink
•UDW API (Contributed to Apache Flink - 0.9)
•Shared Aggregation Operator (experimental)
•Optimiser collocates parallel windows in operators
• Aggregate Store
•Adaptation of FlatFAT1
•Circular Resizable Buffer Strategies
•Non-Eager Strategy Supported for Experiments
1.General incremental sliding-window aggregation - VLDB 15
@CIKM16
Performance AnalysisPeriodic Window Aggregation (DEBS12 dataset)
34
20 40 60 80 100
Number of Queries
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Num
ber
ofPa
rtia
ls
⇥105
CuttyPairs/Pairs+
COUNT-RANGES COUNT-SLIDES0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Num
ber
ofR
ecor
ds
20 40 60 80 100
Number of Queries
0k500k
1000k1500k2000k2500k3000k3500k4000k4500k
Thro
ughp
ut(r
ecor
ds/s
ec)
CuttyPairs+RA
1 10 20 30 40 50 60 70 80 90 100
Number of Queries
104
105
106
107
108
109
1010
1011
Tota
lRed
uce
Cal
ls
Cutty (eager)Pairs+Cutty (lazy)
PairsRANaive
@CIKM16
Performance AnalysisSession Window Aggregation (DEBS12 dataset)
35
SESSION LENGTHS0
5000
10000
15000
20000
25000
30000
35000
Num
ber
ofR
ecor
ds
1 10 20 30 40 50 60 70 80 90 100
Number of Queries
103
104
105
106
107
108
109
Tota
lRed
uce
Cal
ls
Cutty (UPD)Cutty (MERGE)
RA (UPD)RA (MERGE)
1 10 20 30 40 50 60 70 80 90 100
Number of Queries
100
101
102
103
104
105
106
Max
Allo
cati
on(#
part
ials
)
@CIKM16
No limits in multiplexing
36
distance [in km]
time
[in min.]
0
0
6 12 18 24
5
10
15
20
Slice Window
Window Begin
Record
1
@CIKM16
Summary
• UDWs extend the potential of pre-aggregation in window classes beyond fixed periodic windows.
• Cutty takes slicing a step further in terms of computational efficiency which combines seamlessly with eager aggregation.
• First work that addresses multi-query aggregation across diverse window types.
37
@CIKM16
Thank you!
38
@SenorCarbone
https://flink.apache.org/https://github.com/apache/flink