Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik11University of Toronto2IBM Toronto [email protected]
November 3, 2005 CIKM 2005 2
Distinct value combinationsCountry City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinent
al
3 distinct value combinations
1
2
3
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
November 3, 2005 CIKM 2005 3
Motivation Cardinality estimation for query
optimization, e.g., Estimating the size of Estimating the size of the aggregation
Approximate query answering, e.g., COUNT queries
Hotelcitycountry ),(
SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_soldFROM salesGROUP BY sales_date, sales_person
November 3, 2005 CIKM 2005 4
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 5
Related work Previous work has focused on the
case of single attribute. [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
Sampling approach is used. Estimation through sampling is difficult
[CCMN’00] No existing statistical information is
exploited.
November 3, 2005 CIKM 2005 6
Our solution Considering multiple-attributes Utilizing existing statistics on individual
attributes Readily available in most database systems Does not require access to the data
Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.
November 3, 2005 CIKM 2005 7
Estimation with known marginals Number of distinct values in attribute Ai,
frequency vector ),...,2,1( midi
i
i
d
j ijidiii ffff121 1),,...,,(f
)4.0,6.0(1 f
Country City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinental
)4.0,2.0,4.0(2 f )2.0,2.0,2.0,2.0,2.0(3 f
November 3, 2005 CIKM 2005 8
The naïve estimator COLSCARD = Ndm
i i ,min1
Number of possible value combinations
di: the number of distinct values in attribute Ai
Sanity bound: COLSCARD cannot be greater than the table size
The problem: Some value combinations with low occurrence probabilities may not appear in the table!
November 3, 2005 CIKM 2005 9
Upper/Lower bounds Trivial bounds
Upper bound: (the naïve estimator)
Lower bound:
Tighter bounds? In the case of two attributes, tighter bounds
are available.
mddd ,...,,max 21
Ndm
i i ,min1
November 3, 2005 CIKM 2005 10
Tighter boundsN = 10
442
def
118
abc
A2A1
Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4
1
1
value freqvalue freq
[2, 3]
Upper bound = 3+1+1 = 5
November 3, 2005 CIKM 2005 11
Expected number of combinations Assumptions
1. The data distributions of individual columns are independent
2. The occurrence of each combination in the table is independent
Each element of f represents the
frequency of a specific value combination. An estimate of the probability of occurrence
mffff 21
November 3, 2005 CIKM 2005 12
Estimator The probability of the i-th combination
not appearing in a particular tuple is
The probability of the i-th combination not appearing in the table (of size N) is
The expected number of value combinations is
)1( if
i
NifMCOLSCARDE )1(][ )(
1
m
j jdM
Nif )1(
November 3, 2005 CIKM 2005 13
Example revisited Estimate the COLSCARD for attribute set (A1, A2, A3),
given)6.0,3.0,1.0(1 f )99.0,01.0(2 f )95.0,05.0(3 f 100N
New estimate: 5.94
Naïve estimate: 3*2*2 = 12,09405.0,00495.0,00095.0,00005.0(321 ffff,28215.0,01485.0,00285.0,00015.0)05643.0,02970.0,00570.0,00030.0
November 3, 2005 CIKM 2005 14
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 15
Estimation with histograms Histograms exist on individual attributes Two classes of histograms
Partition-based End-biased
Marginals can be (approximately) reconstructed from histograms Optimal histograms in each class?
November 3, 2005 CIKM 2005 16
Optimal histograms Minimizing the error incurred by histograms
ERR = |ESThist – ESTexact| Partition-based histograms
A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.
November 3, 2005 CIKM 2005 17
Optimal end-biased histograms An end-biased histogram with B buckets
stores The exact frequencies of B-1 attribute values The average of the remaining values
Which B-1 values to store exactly? Most widely used end-biased histograms
store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!
November 3, 2005 CIKM 2005 18
Example)9.0,1.0(1 f
0.94) 0.03, 0.02, (0.01, :1 case 2 f0.39) 0.31, 0.29, (0.01, :2 case '
2 f
Attributes (A1, A2)
Choose 1 frequency to store exactly
Index of the frequency stored1 2 3 4
1.68 2.01 2.17 0.150.01 1.10 1.09 1.02
2f'2f
Error table
N=10
November 3, 2005 CIKM 2005 19
Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one of the following
Most frequent values Least frequent values A combination of most frequent and least frequent values
Only need to search both ends Cost is linear in B, independent of dj!
1Bd j
C
November 3, 2005 CIKM 2005 20
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 21
Experiments – Data sets
Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M
Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes
Error measure: ratio error ERR = max{true/est-1, est/true-1}
November 3, 2005 CIKM 2005 22
Effect of data skew
0
1
2
3
4
5
6
7
8
9
ER
R
Proposed estimator 0.000237 0.000933 0.000982 0.0654
Naive estimator 0.0516 6.5171 5.9423 8.4921
z1 = 0,z2=0
z1 = 0,z2=2
z1 = 0,z2=4
z1 = 4,z2=4
N=100K
di=1k
November 3, 2005 CIKM 2005 23
Effect of number of tuples
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1000 10000 100000 1000000
N
ER
R
z=0z=2z=4
November 3, 2005 CIKM 2005 24
Results on real data
(a) Cover Type
31
4
3
52
ERR≤0.05 0.05<ERR≤0.1 0.1<ERR≤0.5 0.5<ERR≤1 ERR>1
(b) Census Income
59
19
102 1
45 pairs 91 pairs
November 3, 2005 CIKM 2005 25
Accuracy of end-biased histograms
0
0.05
0.1
0.15
0.2
0.25
0.3
10 20 30 50
Number of buckets
ER
R
Results on the “capital-gain” attribute of Census Income data set
November 3, 2005 CIKM 2005 26
Conclusions Utilizing existing knowledge
maintained in database systems Proposed upper/lower bounds as well
as an estimator Considered two cases
exact marginal frequencies Histograms: optimal histograms
Experimental results show the effectiveness of the proposed method