scalable approximate query processing
Post on 23-Feb-2016
52 Views
Preview:
DESCRIPTION
TRANSCRIPT
Scalable Approximate Query Processing
Florin Rusu
2
Data Explosion• Data storage advancements
– Price / capacity ($70 / 1 TB)• Human generated
– Web 2.0 & social networking• User data
– Communication• Network & web logs (eBay – 50 TB / day)• Call Detail Records (CDRs)
• Scientific experiments– LHC (Large Hadron Collider)– SKA (Square Kilometer Array) – 1 EB (1018) / day– Sensor networks
04/19/2010
3
Large-Scale Data Analytics• Traditional DB (OLTP)– Multi-user transaction processing– Optimized for specific workloads (views, indexes, …)
• Analytic processing (OLAP)– Data cubes
• Aggregate at different hierarchical levels• Pre-defined aggregates, not flexible
– Shared-nothing architectures (MPP)• Startups: Netezza, Greenplum, AsterData, Vertica, …• Parallel databases on clusters of computers• Storage layer (row store, column store, hybrid)• Compression
04/19/2010
4
Interactive Data Analysis & Exploration
• Ad-hoc queries• Compute statistical aggregates over all data• Example: web log analysis– Documents (URL, Content)– UserVisits (IP, URL, Date, Duration)– “How much time did users spend searching for cars during the
period May – July 2009?”
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
5
Roadmap
• Database query execution• System design & implementation– DataBaseOnline (DBO)
• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples
04/19/2010
6
Query Execution
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
IP URL Date Duration1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
σ
UV
σ
D
⋈
Σ
•Selections push down•Sort-Merge Join•Aggregate
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
7
SelectionURL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
IP URL Date Duration1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
σ
UV
σ
D
⋈
Σ
•Storage manager•One thread for each table scan•Project unused columns
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
8
•Tuples are pipelined into join
SelectionURLA
B
C
E
F
G
I
J
URL DurationA 45
B 60
J 30
D 90
F 15
G 10
E 20
E 35
B 25
J 35
I 25
D 40
C 50
H 75
G 90
F 5
σ
UV
σ
D
⋈
Σ
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
9
URL Duration
A 45
B 60
J 30
D 90
F 15
G 10
E 20
E 35
•Sort tuples on join attribute•Write sorted runs to disk•Buffer space: UV(8)
Sort-Merge Join – Sort Phase
σ
UV
σ
D
⋈
Σ
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
URL
A
B
C
E
F
G
I
J
Run 1
URL Duration
A 45
B 60
D 90
E 20
E 35
F 15
G 10
J 30
URL Duration
B 25
J 35
I 25
D 40
C 50
H 75
G 90
F 5
Run 2
URL Duration
B 25
C 50
D 40
F 5
G 90
H 75
I 25
J 35
04/19/2010
10
Sort-Merge Join – Merge Phase
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
Run 1
URL Duration
D 90
E 20
E 35
F 15
G 10
J 30
Run 2
URL Duration
C 50
D 40
F 5
G 90
H 75
I 25
J 35
URL
B
C
E
F
G
I
J
Run
URL Duration
B 25
B 60
URL Duration
A 45
URL
A
Duration
45
σ
UV
σ
D
⋈
Σ
04/19/2010
11
Sort-Merge Join – Merge Phase
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
Run 1
URL Duration
F 15
G 10
J 30
Run 2
URL Duration
G 90
H 75
I 25
J 35
URL Duration
E 20
E 35
F 5
URL
E
URL Duration
D 40
D 90
σ
UV
σ
D
⋈
Σ
04/19/2010
URLF
G
I
J
Run
12
Duration
0
Duration
45
•Update the sum as tuples are produced
Aggregation
Duration45
σ
UV
σ
D
⋈
Σ
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
13
Duration45
25
60
50
20
35
15
5
10
90
25
30
35
Duration
445
Final Result
σ
UV
σ
D
⋈
Σ
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
14
Roadmap
• Database query execution• System design & implementation– DataBaseOnline (DBO)
• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples
04/19/2010
15
What is the problem?
• TPC-H benchmark results (price / performance)– 10 TB scale
• 928 hard-disks (90 TB total storage capacity)• 16 × quad-core processors• 512 GB RAM• $1.5 million
– Load time: 55 hours– Q1: linear scan over one table with aggregates on top
• 1 query: 19 minutes• 9 queries: 3 hours (linear scaling)
04/19/2010
16
Approximate Query Processing
Time
Que
ry re
sult
Traditional query processing
Result estimate
Confidence bounds
SELECT SUM f(r1•r2• … •rn)FROM R1 as r1, R2 as r2, …, Rn as rn
04/19/2010
17
DBO System Architecture[Rusu et al. 2008]
σ
UV
σ
D
⋈
Σ
DB Engine
Query Result
Levelwise Step Controller
In-Memory Join
⋈UV' D'
Estimation Module
ResultConfidence bounds
1
2 3
4
5
Approximate answer
6
7
04/19/2010
18
Roadmap
• Database query execution• System design & implementation– DataBaseOnline (DBO)
• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples
04/19/2010
19
Sampling[Dobra, Jermaine, Rusu & Xu 2009]
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
IP URL Date Duration1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
σ
UV
σ
D
⋈
Σ
•Control, coordinate & schedule data flow between operators•Embed randomness in each operator
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
URLJ 68
F 220
C 312
H 389
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL Duration
A 45 70
B 60 140
J 30 185
D 90 252
URL
J
In-Memory JoinURL
J
URL
F 220
C 312
A 389
B 447
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL
F 220
C 312
A 389
B 447
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
B 60 140
J 30 185
D 90 252
F 15 358
URL Duration
A 45
In-Memory JoinURL
J
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL
F 220
C 312
H 389
B 447
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
D 90 252
F 15 358
G 10 409
E 20 476
URL Duration
J 30
URL Duration
J 30
In-Memory JoinURL
J
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL
G 515
I 695
E 799
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
B 25 722
J 35 739
I 25 745
D 40 791
URL Duration
J 30
F 15
In-Memory JoinURL
J
F
C
A
B
50% input:360; [-328, 1048] 95% probability
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL
E 799
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
I 25 745
D 40 791
C 50 798
H 75 837
URL Duration
J 30
F 15
B 25
J 35
In-Memory JoinURL
J
F
C
A
B
G
I
Exceed In-Memory Join capacity (10 tuples)!Eliminate tuples such that variance is minimized.
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
URL
E 799
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
I 25 745
D 40 791
C 50 798
H 75 837
URL Duration
J 30
B 25
J 35
In-Memory JoinURL
J
A
B
G
74% input:258; [-293, 808]95% probability
Sampling – Selection
URL ContentJ car 68
F car 220
C car 312
D phone X
A car 389
B car 447
G car 515
H PC X
I car 695
E car 799
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45 70
1 B 06-01-09 60 140
1 J 06-01-09 30 185
1 D 05-15-09 90 252
1 I 04-28-09 35 X
2 A 04-30-09 60 X
2 F 06-15-09 15 358
2 G 06-13-09 10 409
2 E 06-01-09 20 476
2 E 07-10-09 35 495
3 C 04-28-09 25 X
3 B 05-23-09 25 722
3 J 05-29-09 35 739
3 I 06-13-09 25 745
3 D 06-09-09 40 791
4 C 07-30-09 50 798
4 H 05-14-09 75 837
4 H 08-02-09 65 X
4 G 07-23-09 90 953
4 F 06-16-09 5 973
σ
UV
σ
D
⋈
Σ
•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators
URL Duration
URL Duration
J 30
B 25
J 35
G 90
In-Memory JoinURL
J
A
B
G
E
URL
All input:448; [3, 892]
95% probability
27
Sampling Estimation – Intermediate Levels
• Query result estimator & variance estimator computed from result tuples found by In-Memory Join
• Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when
tuples are deleted from In-Memory Join
)( )( )()( )( 212111 22
ni
tRt Rt Rt
in tttfXiptTStTStTStTSIi
nn
11 22
)(... !
21Rt Rt Rt
ni
in
nn
tttfn
ppE
22 Var EE
04/19/2010
28
•Sort tuples on random function of join attribute
Sampling – Join (Sort)
σ
UV
σ
D
⋈
ΣSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
URL
J 888
F 67
C 489
A 227
B 987
G 51
I 342
E 739
Run 1
URL
F 67
A 227
C 489
J 888
Run 2
URL
G 51
I 342
E 739
B 987
URL Duration
A 45 227
B 60 987
J 30 888
D 90 43
F 15 67
G 10 51
E 20 739
E 35 739
B 25 987
J 35 888
I 25 342
D 40 43
C 50 489
H 75 150
G 90 51
F 5 67
URL Duration
D 90 43
G 10 51
F 15 67
A 45 227
E 20 739
E 35 739
J 30 888
B 60 987
Run 1
URL Duration
D 40 43
G 90 51
F 5 67
H 75 150
I 25 342
C 50 489
J 35 888
B 25 987
Run 204/19/2010
29
Duration
0 0
Sampling – Join (Merge)
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
σ
UV
σ
D
⋈
Σ
URL Duration
G 10 51
F 15 67
A 45 227
E 20 739
E 35 739
J 30 888
B 60 987
Run 1
URL Duration
G 90 51
F 5 67
H 75 150
I 25 342
C 50 489
J 35 888
B 25 987
Run 2
Run 1
URL
F 67
A 227
C 489
J 888
Run 2
URL
G 51
I 342
E 739
B 987
URL Duration
G 10 51
G 90 51
URL
G 51
F 67
URL
G 51
URL Duration
G 10 51
G 90 51
Duration
10 51
90 51 In-Memory Join
Duration
100 51
04/19/2010
30
Sampling – Join (Merge)
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
σ
UV
σ
D
⋈
Σ
URL Duration
E 20 739
E 35 739
J 30 888
B 60 987
Run 1
URL Duration
C 50 489
J 35 888
B 25 987
Run 2 Run 1
URL
C 489
J 888
Run 2
URL
E 739
B 987
URL Duration
C 50 489
E 20 739
E 35 739
URL
C 489
E 739
URL
C 489
URL Duration
C 50 489
Duration
50 489 In-Memory Join
Duration
240 489
50% input:468; [194, 741]95% probability
04/19/2010
31
Sampling – Join (Merge)
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
σ
UV
σ
D
⋈
Σ
URL Duration
B 60 987
Run 1
URL Duration
B 25 987
Run 2 Run 1
URL
Run 2
URL
B 987
URL Duration
B 25 987
B 60 987
URL
B 987
URL
B 987
URL Duration
B 25 987
B 60 987
Duration
25 987
60 987
In-Memory Join
Duration
445 987
04/19/2010
32
Sampling Estimation – Upper Level
• Bernoulli sampling with probability given by domain fraction seen so far
• Consolidate tuples generated by same join key• Solve optimization problem to minimize
variance across levels– Keep confidence bounds stable
04/19/2010
33
Contributions
• Design & implement DBO, first online analytical processing engine– Provide estimates & confidence bounds
throughout entire query execution– SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries
over any number of relations• Design & analyze fastest convergent
estimation method for online aggregation– Statistics & optimization techniques
04/19/2010
34
Roadmap
• Database query execution• System design & implementation– DataBaseOnline (DBO)
• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples
04/19/2010
35
Sketches
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
IP URL Date Duration1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
σ
UV
σ
D
⋈
Σ
•Build sketches on join attribute while data is read from disk•Use attributes in aggregate
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
04/19/2010
36
Sketches
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
1 2 3
S1 0 0 0
A B C D E F G H I J
S1 + - - - - + + + - -
A B C D E F G H I J
S1 1 2 3 1 1 2 2 3 3 3
URL
A
1 2 3
S1 0 0 0
1 2 3
S1 1 0 0
S1 + S1 1
04/19/2010
37
Sketches
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
1 2 3
S1 1 0 0
A B C D E F G H I J
S1 + - - - - + + + - -
A B C D E F G H I J
S1 1 2 3 1 1 2 2 3 3 3
URL Duration
A 45
1 2 3
S1 0 0 0
1 2 3
S1 45 0 0
S1 + S1 1
04/19/2010
38
Sketches
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
1 2 3
S1 0 1 -3
A B C D E F G H I J
S1 + - - - - + + + - -
A B C D E F G H I J
S1 1 2 3 1 1 2 2 3 3 3
1 2 3
S1 -140 35 -65
S1 230
04/19/2010
39
Sketches
URL ContentA car
B car
C car
D phone
E car
F car
G car
H PC
I car
J car
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
1 2 3
S1 0 1 -3
S2 -1 2 1
S3 -3 0 1
A B C D E F G H I J
S1 + - - - - + + + - -
S2 + - + - + - + - + -
S3 - - - + + - + + - +
A B C D E F G H I J
S1 1 2 3 1 1 2 2 3 3 3
S2 3 3 2 1 2 1 2 1 3 2
S3 1 1 2 1 3 1 3 2 3 2
1 2 3
S1 -140 35 -65
S2 -225 140 -15
S3 -20 90 130
S1 230
S2 490
S3 190
230; [-416, 876]95% probability
04/19/2010
40
Sketches Estimation
• Two random processes– Bucket selection– Sign
• Sketch update• Estimator• Confidence bounds– Multiple independent sketches– Chebyshev & Chernoff inequalities (worst-case)– Median Central Limit Theorem, Student-t distribution
(statistics)
HDh :
1,1: D
2,1,),join.()()join.()Sk(join)()Sk( iRtttfthR.thR iiiiiiiii
11 22
)(,])[(Sk)(Sk 2121Rt RtHh
ttfEhRhR
04/19/2010
41
Pseudo-Random Number Generators[Rusu & Dobra 2006, 2007b]
• Detailed comparison of generating schemes– Abstract algebra (orthogonal arrays, vector spaces,
prime & extension fields)• Degree of independence as function of seed size• Fast range-summable
– Empirical evaluation• Generating time is few processor cycles
• Identify EH3 as generator for sketches– Lowest possible degree of independence– 7.3 ns to generate single number
04/19/2010
42
Statistical Analysis[Rusu & Dobra 2007a, 2008]
• Detailed comparison of sketch estimators– Same accuracy (worst-case analysis)– Statistical analysis
• Distribution (probability density function)• Higher frequency moments (kurtosis)• Confidence bounds
– Empirical evaluation• Data skew, correlation, memory usage, update time
• Identify Fast-AGMS as most reliable scheme– Accurate over entire range of data– Small memory footprint, fast update time
04/19/2010
43
Roadmap
• Database query execution• System design & implementation– DataBaseOnline (DBO)
• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples
04/19/2010
44
Sketches over Samples[Rusu & Dobra 2009]
σ
UV
σ
D
⋈
Σ
•Data is random on disk•Build sketches on join attribute while data is read from disk•Use attributes in aggregate•Provide estimates at any point
SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
URL Content
J car
F car
C car
D phone
A car
B car
G car
H PC
I car
E car
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
04/19/2010
45
Sketches over SamplesSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]
IP URL Date Duration
1 A 05-30-09 45
1 B 06-01-09 60
1 J 06-01-09 30
1 D 05-15-09 90
1 I 04-28-09 35
2 A 04-30-09 60
2 F 06-15-09 15
2 G 06-13-09 10
2 E 06-01-09 20
2 E 07-10-09 35
3 C 04-28-09 25
3 B 05-23-09 25
3 J 05-29-09 35
3 I 06-13-09 25
3 D 06-09-09 40
4 C 07-30-09 50
4 H 05-14-09 75
4 H 08-02-09 65
4 G 07-23-09 90
4 F 06-16-09 5
1 2 3
S1 1 1 -2
S2 -1 0 1
S3 -2 0 0
A B C D E F G H I J
S1 + - - - - + + + - -
S2 + - + - + - + - + -
S3 - - - + + - + + - +
A B C D E F G H I J
S1 1 2 3 1 1 2 2 3 3 3
S2 3 3 2 1 2 1 2 1 3 2
S3 1 1 2 1 3 1 3 2 3 2
1 2 3
S1 -100 -35 -30
S2 -105 35 -15
S3 -30 30 65
URL Content
J car
F car
C car
D phone
A car
B car
G car
H PC
I car
E car
S1 -300
S2 360
S3 240
50% input:100; [-2382, 2582]
95% probability
04/19/2010
46
Sketches over Samples – Estimation
• Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain– Sketches – random projection, frequency domain– Consider correlation between multiple sketches that share
same sample– Moment generating functions
• Generic analysis independent of sampling process– Bernoulli sampling– Sampling without replacement– Sampling with replacement
04/19/2010
47
Sketches over Samples – Analysis
''i
Dii gEfECXE
jDi Dj
jii gfCX
''
Di Dji
Dii
Diiijiji
Djj
Dii gEfEgEfEggEffEgEfECX
2''2'2'''''2'2'2 22Var
Var[sketch over samples] =Var[samples] + Var[sketch] + Var[interaction]
04/19/2010
48
Conclusions• Data explosion– Cheap, high-capacity storage– Current processing technology is too expensive for performance
it provides• Framework for online analytical processing– DBO system architecture
• Embed randomization into data processing• Provide estimates and bounds at any time
– Approximation methods• Sampling – most flexible• Sketches – single pass• Sketches over samples – fastest
04/19/2010
49
Future Work• Short term
– Define & design query optimization for DBO– Extend DBO to other types of queries and with other approximation techniques
(end-biased samples, histograms, …)– Generalize sketches to multiple relations– Find optimal amount of data to sketch– Fully integrate sketches into DBO system
• Medium term– Develop data aggregation & approximation techniques for other types of
architectures• Multicore processors, GPUs• Distributed processing (Map-Reduce, Hadoop, …)
• Long term– Design & build scalable analytic processing system
• Aggregation & approximation
04/19/2010
50
Publications• A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate
Convergence in DBO. In VLDB 2009.• F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009.• F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo).• F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol.
33, no. 3, 2008.• F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-
Based Estimations. In TODS, vol. 32, no. 2, 2007.• F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In
SIGMOD 2007.• F. Rusu and A. Dobra – Fast Range-Summable Random Variables for
Efficient Aggregate Estimation. In SIGMOD 2006.
04/19/2010
top related