The Impact of DualityThe Impact of Dualityon Data Synopsis Problemson Data Synopsis Problems
Panagiotis KarrasPanagiotis KarrasKDD, San Jose, August 13th, 2007
work with Dimitris Sacharidis and Nikos Mamouliswork with Dimitris Sacharidis and Nikos Mamoulis
IntroductionIntroduction• Data synopsis problems require the
optimization of error under a bound on space.• Classical approaches treat them in a direct
manner, producing complicated solutions, and sometimes resorting to heuristics.
• Parameters involved have a monotonic relationship.
• Hence, an alternative approach is possible, based on the dual, error-bounded problems.
OutlineOutline• Histograms.• Restricted Haar Wavelet Synopses.• Unrestricted Haar and Haar+ Synopses.• Experiments.• Conclusions.
HistogramsHistograms• Approximate a data set [d1, d2, …, dn] with B buckets,
si = [bi, ei, vi] so that a maximum-error metric is minimized.
• Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 ijbjEbiE
ij,1,1,maxmin,
1
nnBO 2log
• Recent solutions: Buragohain et al. ICDE 2007
Guha and Shim TKDE 19(7) 2007 For weighted error:
Liner for:
Bn
UnnO loglog
nBnO 32 log
n
nB
3log 199824,741,073,1230 Bn
nBnnO 62 loglog
HistogramsHistograms
• Solve the error-bounded problem.
Maximum Absolute Error bound ε = 2
4 5 6 2 15 17 3 6 9 12 …
[ 4 ] [ 16 ] [ 4.5 ] […
• Generalized to any weighted maximum-error metric.
Each value di defines a tolerance interval
Bucket closed when running intersection of interval becomes null
Complexity:
ii
ii w
dw
d
,
nO
HistogramsHistograms
• Apply to the space-bounded problem.
Perform binary search in the domain of the error bound ε
Complexity: *lognO
For error values requiring space , with actual error , run an optimality test:BB
Error-bounded algorithm running under constraint instead oferror error
If requires space, then optimal solution has been reached.BB ~error
Independent of buckets B
34 16 2 20 20 0 36 16
0
18
7 -8
9 -9 1010 25 11 10 26
Restricted Haar Wavelet Restricted Haar Wavelet Synopses Synopses
• Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized.
• Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005
18 18
1,,
,,,max
,,,
,,,max
min,,
bbzviE
bzviE
bbviE
bviE
bviE
iR
iL
R
L
2nO
Restricted Haar Wavelet Restricted Haar Wavelet SynopsesSynopses
• Solve the error-bounded problem. Muthukrishnan FSTTCS 2005
Local search within each of subtrees in bottom Haar tree levels
n
nOlog
2
1,,
,,,min,
iRiL
RL
zviSzviS
viSviSviS
nloglog
n
n
log
Complexity:
• Apply to the space-bounded problem.
Complexity:
n
nOlog
log *2
no significant advantage
Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses
• Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized.
• Classical solutions: Guha and Harb KDD 2005, SODA 2006
0,,
,,,max
min,,00
zbbzviE
bzviE
bviE
R
L
zbbSz vi
BnnRO 22 loglog
c1+
c2 c3C1
c5 c6+
C2 c7c8 c9
c
o
d3d2d1d0
-++ +
-+c4
+-+
+ +
C3
0,,
,,,maxmin
,0,,
,,,maxmin
,0,,
,,,maxmin
min,,
00
00
00
,
,
,
rrR
L
zbbSz
lR
lL
zbbSz
hhR
hL
zbbSz
zbbzviE
bviE
zbbviE
bzviE
zbbzviE
bzviE
bviE
r
vRir
l
vLil
h
vHih
n
B
nRBO log
time
space
Karras and Mamoulis ICDE 2007
Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses• Solve the error-bounded problem.
nnRO log2
0,,min,
zzviSzviSviS RLSz vi
Complexity:
• Apply to the space-bounded problem.
Complexity:
unrestricted Haar
0,
,,maxmin
,0,
,,maxmin
,0,
,,maxmin
min,
,
,
,
rrR
L
Sz
lR
lL
Sz
hhR
hL
Sz
zzviS
viS
zviS
zviS
zzviS
zviS
viS
vRir
vLil
vHih
Haar+
time nnRO log space
nnRO loglog *2 significant time & space advantage
Experiments: Histograms, Time Experiments: Histograms, Time vs. nvs. n
Experiments: Histograms, Time Experiments: Histograms, Time vs. Bvs. B
Experiments: Haar Wavelets, Time Experiments: Haar Wavelets, Time vs. nvs. n
Experiments: Haar Wavelets, Time Experiments: Haar Wavelets, Time vs. Bvs. B
Experiments: HaarExperiments: Haar++, Time vs. n, Time vs. n
Experiments: HaarExperiments: Haar++, Time vs. B, Time vs. B
ConclusionsConclusions• Offline space-bounded data synopsis
problems are more easily solvable through their error-bounded counterparts.
• Complexities lower & independent of synopsis space.
• Dual-problem-based algorithms are simpler, more scalable, more general, more elegant, and more memory-parsimonious than the direct ones.
• Future: application on other data representation models, multi-measure, multi-dimensional data.
Related WorkRelated Work• H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.
C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998
• S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004
• M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005 (also PODS 2004).
• S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005
• S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005
• S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006
• S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005
• P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. ICDE 2007
Thank you! Questions?Thank you! Questions?
More discussion at Board More discussion at Board 17 this evening17 this evening
Compact Hierarchical Compact Hierarchical HistogramsHistograms
• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.
• Heuristic solutions: Reiss et al. VLDB 2006
BnnBO loglog2
c0
c1 c2
c3 c4c5 c6
d3d2d1d0
nnBO 2log
time
space
The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.
[Reiss et al. VLDB 2006]
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case
dcbavdcba
dcbavdcbadcbavdcba
dcbavdcba
viS
,,,,
,,,,,,,,
,,,,
,2
,1
,0
,
1,,, ** ii ssviSv
cic2i c2i+1
bav ,
z00
ba, dc,
dcba ,,
cic2i0 0
z
dcbav ,,
dc, ba,
dcba ,,
dcz , dcbaz ,,
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case
0000
00000000
0000
**
**
**
,2
,1
,
,
RLRL
RLRLRLRL
RLRL
v
vv
v
ss
ss
ss
viS
RL
RL
RL
ii
ii
ii
*0
*0 ,,
RL iRiL sviSvsviSv RL
Complexity: nnOn
On 2log
0 1log
22
time
space
• Apply to the space-bounded problem.
Complexity: Polynomially Tractable
nOOn
log
02
nnnO logloglog *2