towards estimating the number of distinct value combinations for a set of attributes

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik11University of Toronto2IBM Toronto Labxhyu@cs.toronto.edu

November 3, 2005 CIKM 2005 2

Distinct value combinationsCountry City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinent

3 distinct value combinations

COLSCARD (COlumn Set CARDinality) = 3

The problem: estimating COLSCARD for a given set of attributes

Motivation Cardinality estimation for query

optimization, e.g., Estimating the size of Estimating the size of the aggregation

Approximate query answering, e.g., COUNT queries

Hotelcitycountry ),(

SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_soldFROM salesGROUP BY sales_date, sales_person

Roadmap Related work Estimation with known marginal

distributions Upper/lower bounds An estimator

Estimation with histograms Experimental results Conclusions

Related work Previous work has focused on the

case of single attribute. [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]

Sampling approach is used. Estimation through sampling is difficult

[CCMN’00] No existing statistical information is

exploited.

Our solution Considering multiple-attributes Utilizing existing statistics on individual

attributes Readily available in most database systems Does not require access to the data

Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.

Estimation with known marginals Number of distinct values in attribute Ai,

frequency vector ),...,2,1( midi

j ijidiii ffff121 1),,...,,(f

)4.0,6.0(1 f

Country City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinental

)4.0,2.0,4.0(2 f )2.0,2.0,2.0,2.0,2.0(3 f

The naïve estimator COLSCARD = Ndm

i i ,min1

Number of possible value combinations

di: the number of distinct values in attribute Ai

Sanity bound: COLSCARD cannot be greater than the table size

The problem: Some value combinations with low occurrence probabilities may not appear in the table!

Upper/Lower bounds Trivial bounds

Upper bound: (the naïve estimator)

Lower bound:

Tighter bounds? In the case of two attributes, tighter bounds

are available.

mddd ,...,,max 21

i i ,min1

Tighter boundsN = 10

Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4

value freqvalue freq

[2, 3]

Upper bound = 3+1+1 = 5

Expected number of combinations Assumptions

1. The data distributions of individual columns are independent

2. The occurrence of each combination in the table is independent

Each element of f represents the

frequency of a specific value combination. An estimate of the probability of occurrence

mffff 21

Estimator The probability of the i-th combination

not appearing in a particular tuple is

The probability of the i-th combination not appearing in the table (of size N) is

The expected number of value combinations is

)1( if

NifMCOLSCARDE )1(][ )(

Nif )1(

Example revisited Estimate the COLSCARD for attribute set (A1, A2, A3),

given)6.0,3.0,1.0(1 f )99.0,01.0(2 f )95.0,05.0(3 f 100N

New estimate: 5.94

Naïve estimate: 3*2*2 = 12,09405.0,00495.0,00095.0,00005.0(321 ffff,28215.0,01485.0,00285.0,00015.0)05643.0,02970.0,00570.0,00030.0

Estimation with histograms Histograms exist on individual attributes Two classes of histograms

Partition-based End-biased

Marginals can be (approximately) reconstructed from histograms Optimal histograms in each class?

Optimal histograms Minimizing the error incurred by histograms

ERR = |ESThist – ESTexact| Partition-based histograms

A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.

Optimal end-biased histograms An end-biased histogram with B buckets

stores The exact frequencies of B-1 attribute values The average of the remaining values

Which B-1 values to store exactly? Most widely used end-biased histograms

store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!

Example)9.0,1.0(1 f

0.94) 0.03, 0.02, (0.01, :1 case 2 f0.39) 0.31, 0.29, (0.01, :2 case '

Attributes (A1, A2)

Choose 1 frequency to store exactly

Index of the frequency stored1 2 3 4

1.68 2.01 2.17 0.150.01 1.10 1.09 1.02

Error table

Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one of the following

Most frequent values Least frequent values A combination of most frequent and least frequent values

Only need to search both ends Cost is linear in B, independent of dj!

Experiments – Data sets

Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M

Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes

Error measure: ratio error ERR = max{true/est-1, est/true-1}

Effect of data skew

Proposed estimator 0.000237 0.000933 0.000982 0.0654

Naive estimator 0.0516 6.5171 5.9423 8.4921

z1 = 0,z2=0

z1 = 0,z2=2

z1 = 0,z2=4

z1 = 4,z2=4

N=100K

Effect of number of tuples

1000 10000 100000 1000000

z=0z=2z=4

Results on real data

(a) Cover Type

ERR≤0.05 0.05<ERR≤0.1 0.1<ERR≤0.5 0.5<ERR≤1 ERR>1

(b) Census Income

45 pairs 91 pairs

Accuracy of end-biased histograms

10 20 30 50

Number of buckets

Results on the “capital-gain” attribute of Census Income data set

Conclusions Utilizing existing knowledge

maintained in database systems Proposed upper/lower bounds as well

as an estimator Considered two cases

exact marginal frequencies Histograms: optimal histograms

Experimental results show the effectiveness of the proposed method

towards estimating the number of distinct value combinations for a set of attributes

specific value combination

nave estimate

nave estimator colscard

ith combination

table of size n

attribute aisanity

case of single attribute

table sizethe problem

Documents

a hospital based study on pulmonary manifestations...

attributes sketch 4 - amazon s3 · attributes, moral...

mining association rules. association rules association...

unite: preposition combinations - people...

apfs - objective by the sea · apfs features • additional...

university of southern queensland how informative is...

query processing with k-anonymity · 1department of...

key selection attributes · 2019-10-30 · trim channel:...

peter van beek, john r. smith, touradj ebrahimi, teruhiko...

amino acid combinations (only combinations) psusa10187...

cadherincombinationsrecruitdendritesofdistinct retinal...

permutations and combinations permutations and combinations

1.8 combinations of functions jmerrill, 2010 arithmetic...

apfs - newosxbook.comnewosxbook.com/files/apfs.pdf · apfs...

static modeling using class diagrams classes, objects,...

permutations and combinations -...

speech act verbs · resource situation types, which are...

world explorer - amazon web services€¦ · distinct and...

towards estimating the number of distinct value combinations...

library part maker - files.aclab.pro · the archicad...