new sampling-based summary statisticsfor …adobra/approxqp/talk5.pdf“approximate but fast answers...

1

CAP 6930: Approximate Query Processing

New New SamplingSampling--BasedBased Summary StatisticsSummary Statistics for for

Improving Approximate Query AnswerImproving Approximate Query Answer

By

Gibbons and Matias

Presented by: Abhijit Pol

2

New Sampling-Based Summary Stats 2


Outline …Outline …

ª A Framework For Approximate Query Answering

ª Approximate Answers Using Samples

ª Concise Samplesª Definitionª Algorithmsª Evaluation

ª Counting Samplesª Definitionª Algorithms ª Evaluation

ª Hot List Queries

3



A Framework for AQAA Framework for AQA

Data Warehouse

New Data

Queries

Responses

Data Warehouse

New Data

Queries

Responses

Approx.AnswerEngine

A Traditional System System Set-up for AQA

4



What are the goals?What are the goals?

ª Approximate but fast answers to queries over Accurate but slowanswering.

ª The avg student grade with 95% confidence is 2.85 ± 0.2 (10 s)ª The avg student grade is 2.9154444 (105 s)

ª Orders of magnitude less time than the time to compute an exact answer.

ª Save on costly disk I/Os

5



Where it fits?Where it fits?

ª Scenarios for which an exact answer may not required. ª Drill-down query sequence in ad-hoc data mining

ªGet the idea about an answer before putting query load. ª Provides feedback on how well query is

ª In query optimizer to estimate plan costs.ª Traditional role

6



How it works?How it works?

ª Engine maintains various summary statistics which are referred as Synopsis data structures

ª Synopses can be maintained by:

ª Observing the new data as it is loaded into the DWª Periodically returning to the DW to update informationª Returning to the DW at query time

ª Use synopses to report an approximate answer either in continuous or discrete manner.

7



How to evaluate it?How to evaluate it?

ª Coverage: The range of queries

ª Response Time: The time to provide an approx answer

ª Accuracy: The accuracy and confidence in that accuracy

ª Update Time: The overhead for up-to-date synopses

ª Footprints: The storage requirement for its synopsis

8



Approximate Answers Using SamplesApproximate Answers Using Samples

ª Synopsis can be:ª Sample basedª Histogram basedª Sketch based

ª Sample based Synopsis can be maintained as:ª Have a counter for # tuples in the relationª If relation has at most M tuples, store allª If relation has more than M tuples, store random sample of

size M

9



What is the big idea in paper?What is the big idea in paper?

ª The goal is to develop effective synopsis that capture important information about the data in a concise representation

ª Paper introduces two new sampling-based summary statistics, concise samples and counting samples.

ª They also presented new techniques for their fast incremental maintenance.

10



Quick definitionsQuick definitions

ª Sample Size:If S is a sample of N data points then the sample size |S| is nothing but the number of data points in S.

ª Footprint:Footprint of a sample S is its storage requirement in terms of number of data points.

ª For traditional sampling, e.g. reservoir sampling, Sample Size == Footprint

ª For concise and counting samplingSample Size >= Footprint

11



Concise SamplesConcise Samples

ª Definition # 1A concise sample is a uniform random sample of the data set such that values appearing more than once in the sample are represented as a <value, count> pair.

ª Represent ‘C’ copies of the same value ‘V’ as <V,C> pair; there by free-up space for ‘C-2’ additional sample points!!

12



An ExampleAn Example

Traditional Sample Concise Sample

SS 12 12FS 12 7

136377173177

<1,3><3,3>

6<7,5>

Can hold more samples for same footprint!!

13



Definition # 2Definition # 2

ª Let S = {(V1, C1),… , (Vj, Cj), Vj+1…Vl} be a concise sample, Then Sample Size (S) = l - j + ? Ci for i = 1 to j, and Footprint (S) = l + j

ª For m/2 distinct values => footprint at most m

ª Lemma#1: For any footprint m >= 2, there exists data sets for which the sample-size of a concise sample is n/m times larger than its footprint, where n is the size of the data set.

14



Algorithm: Obtaining a sampleAlgorithm: Obtaining a sample

ª Need a concise sample of footprint m from a relation R with n tuples residing on disk

For m times do: {Select a random tuple and extract R.A

}Semi-sort the set of values to produce (value, count) pairs. Sample until (All n data-points are seen || footprint size == m) {

For each new value sampled look-up for its existence in thecurrent concise sample If present as pair:

increment the count for a pairIf present as singleton:

convert a singleton to a (value, 2) pairElse: add a new singleton value

}

15



Algorithm: Obtaining a sampleAlgorithm: Obtaining a sample

ª Refer it as offline/static algorithm

ª Complexity of algorithm: T (Sample Size) disk accesses

ª Incremental (Online) maintenance algorithm with No Disk Accesses

ª In general can be used for concise sample in one sequential pass over a relation

16



Algorithm: Incremental maintenanceAlgorithm: Incremental maintenance

ª Concise sample with in a given footprint bound as new data inserted

ª Can Vitter’s idea be used to insert new data?

ª Problem of insertion is more difficult here –ª There you know the sample size in advanceª In concise sample the sample size is depend on data

distribution ª Any change in data distribution must be reflected in the

sampling frequency

17




Set entry threshold T = 1; S be current concise sampleFor each new tuple ‘t’ selected for the sample do: {

With Pr(1/T) do: {Based on look-up of t.A in S either:create singleton && footprint++ ||create a pair && footprint++ ||increment the counter of a current pair

}} If footprint == pre-specified footprint then do: {

Evict()}

18




Evict() {T = T’ // T’ > TFor each sample in S do: {

With Pr (T/T’) do: {If point is singleton evict it && footprint-- ||If point is <v,2> pair make it singleton && footprint-- ||If its pair decrement the counter

}//Note E(|S|) = |S| * (T/T’)If footprint == pre-specified footprint then do: {

Evict()}//Note subsequent inserts to S are done with 1/T’ not 1/T

}}

19



Theorem # 2Theorem # 2

ª For any sequence of insertions, the above algorithm maintains a concise sample.

ª Proof:ª Let T be the current thresholdªWe maintain the invariant that each tuple in R has been

treated with Tª The crux of proof is to show this invariant is maintained when

T is raised to T’ªWhen we do so we subject all samples in S in a for loop to

evict them with Pr (T/T’)

20



Theorem # 2Theorem # 2

ª Proof:ª Now for each tuple ‘t’ in R

ª If it was NOT in S before evictionªThen a coin with heads Pr(1/T) was flipped and failed to come up

head for this tuple ‘t’ªThus the same probabilistic event would fail to come up heads

with the new, stricter coin. (1/T’ < 1/T)

ª If it was in S before evictionªThen a coin with heads Pr(1/T) was flipped showed headª1/T * T/T’ = 1/T’. The result is that the tuple is in the sample with

probability I/T’. Thus the inductive invariant is indeed maintained.

21



How much to raise threshold?How much to raise threshold?

ª Large Raise:ª Evict more than is needed, smaller sample-sizeª Evict () run less frequently

ª Small Raise:ª No problem with sample-sizeª Likelihood that footprint won’t decrease and Evict () run more

frequently because of repeated increase in T

ª Typical (experimental) value 10%

22



Plugging in Vitter’s ideaPlugging in Vitter’s idea

ª Don’t flip for each insert flip it to determine how many inserts can be skipped before the next insert

ª X is a event of skipping exact i elements

Pr[ X ] = Pr [skip i elements in a row] * Pr[Pick i+1th]= ( 1- Pr[Pick a element]) i * Pr[Pick a element]= ( 1- 1/T) i * 1/T

ª As T gets large we save on the number of coin flips and hence the update time.

ª Also the probability of evicting a sample (T/T’) is typically small and we can save on coin flips and decrease the update time for evicting.

23



Online Algorithm: AnalysisOnline Algorithm: Analysis

ªO (Sample Size) in terms of coin flips before T was raised

ª After that T is raised by constant amount each time, we expect constant # coin tosses resulting in sample points being retained for each sample point evicted.

ª Thus we have an O (1) amortized expected update time per insert regardless of the data distribution.

24



Quantifying the SS advantageQuantifying the SS advantage

ªWe know for concise sampling, SS >= Footprint

ª The expected SS increases with the skew of data

ª For exponential distribution the advantage is exponential !!

ª Theorem # 3: Consider the family of exponential distributions: for i = 1,2,..., Pr(v = i) = a –i (a -1), for a > 1. For any footprint m >= 2, the expected sample-size of a concise sample with footprint m is at least a m/2

25




ª Proof: At least a m/2 => Need lower bound

ª Expected SS can be lower bounded by expected number of randomly selected tuples before the (m/2 +1)th distinct tuple value v is selected.

ª Probability of selecting value > m/2 is:? a –i (a -1) for i = m/2+1 to 8 ,

= a m/2

ª So the expected number of tuples selected before such an event occurs is a m/2.

26




ª Expected gain over traditional sample for arbitrary data sets.

ª Frequency Moment:Let A = {a1,a2,…an} where ai is member from N = {1,2..,n}Let mi = |j: aj = i| denotes # occurrences of i in Seq.Then Fk = ? m i

k for i = 1 to nF0 = distinct elements in sequenceF1 = length of sequenceF2 = Self join size of relation!!

27




ª Theorem # 4: For any data set, when using a concise sample S with sample-size s, the expected gain is:E [s – # distinct value in S] = ? (-1) k (s!/ (s-k) ! k!) Fk/nk for k = 2 to s

ª All we are interested in finding out the expected value of number of distinct values in S

28




ª Proof: Lets define Xi to be indicator random variable

ª Xi = 1, If the ith item selected to be in the traditional sample has a value not represented as yet in the sampleXi = 0, otherwise

ª X = ? Xi = Number of distinct values; for i = 1 to sE[x] = ? E[Xi] for i = 1 to s

ª Pr( Xi = 1) = ? Pj (1-Pj)i-1 Where Pj = nj/n be the prob. that an item selected at random from the set is of value j.

29



Experimental EvaluationExperimental Evaluation

ª Experiments evaluating the gain in the sample-size of concise samples over traditional samples

ª 500K new values were inserted with D potential number of distinct values (varied from 500 to 50K)

ªm value 100 and 1000 was used.

ª Compared for three algorithms: A traditional reservoir sampling, concise online algorithm, and concise offline/static algorithm

ª They used a large variety of Zipf data distributions with zipf parameter was from 0 to 3 in increments of 0.25

30



What is Zipf distribution?What is Zipf distribution?

ª Some applications of exponential distribution are known under the name Zipf’s Law

ª It is a discrete distribution with very similar idea of 80-20 rule.

ª P(x) = x –a where a is zipf parameter.

ª a = 0 => Uniform distribution and more the value of a more is the data skewed

31




Experiment (a) m = 100 and (b) m = 1000

32




Experiment (c) D/m = 50 and (d) D/m = 5

33




ª Update time overheads:ª Coin flips for inserts and evictsª Look-up into current concise sample

34



Counting SamplesCounting Samples

ª A Counting samples are a variation on concise samples.

ª Keep track of all occurrences of a value inserted into the relation since the value was selected for the sample.

ª Definition 3: A counting sample for R.A with threshold T is any subset of R.A obtained as follows:

1. For each value v occurring c > 0 times in R, we flip a coin with probability 1 / T of heads until the first heads, up to at most c coin tosses in all; lf the ith coin toss is heads then v occurs c - i + 1 times in the subset, else v is not in the subset.

35



Counting SamplesCounting Samples

2. Each value v occurring c > 1 times in the subset is represented as a pair (v, c), and each value v occurring exactly once is represented as a singleton v.

ª Although counting samples are not uniform random samples of the base data, they can be used to obtain such a sample without any further access to the base data.

ª Can obtain a concise sample from a counting sample by considering each pair (v, c) in the counting sample in turn, and flipping a coin with probability 1/T of heads c - 1 times and reducing the count by the number of tails.

new sampling-based summary statisticsfor …adobra/approxqp/talk5.pdf“approximate but fast answers...

Documents