harikrishnan karunakaran sulabha balan cse 6339. introduction icicles icicle maintenance ...

30
Harikrishnan Karunakaran Sulabha Balan ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan CSE 6339

Upload: george-anthony

Post on 16-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Harikrishnan Karunakaran Sulabha Balan

ICICLES: Self-tuning Samples for Approximate Query Answering

By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan

CSE 6339

Page 2: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Introduction

Icicles

Icicle Maintenance

Icicle-Based Estimators

Quality & Performance

Conclusion

Outline

Page 3: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Analysis of data in data warehouses useful in decision support

Users of decision support systems want interactive systems

OLAP – Online Analytical Processing Aggregate Query Answering Systems

(AQUA) developed to reduce response time to desirable levels

Tolerant of approximate results

Introduction

Page 4: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Various Approaches

Sampling-based

Histogram-based

Clustering

Probabilistic

Wavelet-based

Approximate Querying

Page 5: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Uniform Random Sampling

Branch

State Sales

1 CA 80K

2 TX 42K

3 CA 40K

4 CA 42K

5 TX 75K

6 CA 48K

7 TX 55K

8 TX 38K

9 CA 40K

10 CA 41K

Branch

State Sales

2 TX 42K

4 CA 42K

6 CA 48K

8 TX 38K

10 CA 41K

50%

Sample

SELECT SUM(sales) x 2 AS cnt

FROM s_sales

WHERE state = ‘TX’

S_sales

scale factor

Sales

Page 6: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Biased Sampling

Sample relation for aggregation query workload regarding Texas branches

Branch

State Sales

1 CA 80K

2 TX 42K

3 CA 40K

4 CA 42K

5 TX 75K

6 CA 48K

7 TX 55K

8 TX 38K

9 CA 40K

10 CA 41K

Branch

State Sales

2 TX 42K

4 CA 42K

5 TX 75K

7 TX 55K

8 TX 38K

SalesS_sales

Page 7: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

All tuples in a Uniform Random Sample are treated as equally important for answering queries

Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload

Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload

Methodology

Page 8: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables

Join Synopsis

SELECT COUNT(*), AVG(LI Extendedprice), SUM(LI Extendedprice) FROM LI, C, O, S, N, R WHERE C Custkey=O Custkey AND O Orderkey=LI

Orderkey AND LI Suppkey=S Suppkey AND C Nationkey=N

Nationkey AND N Regionkey=R Regionkey AND R Name=North

America AND O Orderdate01-01-1998 AND O Orderdate12-31-

1998;

Page 9: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses

Uniform Random Sample of Relation wastes memory

OLAP queries exhibit locality in their data access

Need for Icicles

Page 10: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Class of samples to capture data locality of aggregate queries of foreign key joins

Identify focus of a query workload and sample accordingly

Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)

Is a non-uniform sample of the original relation R

Icicles

Page 11: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Icicle L

Page 12: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Icicle Maintanence Algorithm

Page 13: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Algorithm is efficient due to

Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency

Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload

Reservoir Sampling Algorithm

Icicle Maintanence Algorithm

Page 14: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Icicle Maintenance Example

SELECT average(*)

FROM widget-tuners

WHERE date.month = ‘April’

Page 15: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

• In spite of unified sampling being used the result is a biased sample

• Frequency Relation maintained over all tuples in relation

• Different Estimation mechanisms for Average, Count and Sum

Icicle-Based Estimators

Page 16: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Average Average taken over set of distinct sample tuples that satisfy the query

predicate of the average query is a pretty good estimate of the average

Count Sum of Expected Contributions of all tuples in the sample that

satisfy the given query

Sum Estimate is given by the product of the average and the count

estimates

Estimators

Page 17: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Frequency Attribute added to the Relation

Starting Frequency set to 1 for all tuples

Incremented each time tuple is used to answer a query

Frequencies of relevant tuples updated only when icicle updated with new query

Maintaining Frequency Relation

Page 18: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation

Accuracy improves with increase in number of tuples used to compute it

Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle

Quality Guarantees

Page 19: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Quality Guarantees contd...

Page 20: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Performance EvaluationSELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)

FROM LI, C, O, S, N, R

WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND

C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND

R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998

SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)

FROM LICOS-icicle, N, R

WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND

R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998

Qworkload : Template for generating workloads

Template for obtaining approximate answers

Page 21: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Performance Evaluation contd...

Page 22: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

The Error Plots for Comparison

Static uniform random sample on Join Synopsis

Icicle as it evolves with the workload

Icicle-Complete which is formed after entire workload has been executed once

Performance Evaluation contd...

Page 23: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Focused Queries

Performance Evaluation contd...

Page 24: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Performance Evaluation contd...

Mixed Workload

Page 25: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples

Icicle plot shows a convergence to the Icicle-Complete plot

Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast

Observations (focused)

Page 26: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Improvement due to usage of icicles is not significant

Can be concluded that icicles are at worst as good as the static samples

Observations (mixed)

Page 27: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Icicles provide class of samples that adapt according to the characteristics of the workload

It can never be worse than the case of static sampling

It focuses on relatively small subsets in the relation

Conclusion

Page 28: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

There is no significant gains in the case of Uniform Workload

There is a trade-off between accuracy and cost

Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.

Inferences

Page 29: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference 2000.

S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999

References

Page 30: Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Thank You

Questions?