statistics profile for query optimization

05/01/04Spring 2004, CSE8330 Presentition1

Statistics Profile Statistics Profile For For

Query OptimizationQuery Optimization

WENYI NI


Introduction Introduction

What is statistics profile?

•Every object has its own status.

•In order to know its status, we need statistics.

•The relation between Statistics profile and statistics.


When DBMS use statistics profile?

From M.Tamer Oszu

Cost Model


What does statistics profile What does statistics profile collect?collect?The central tendency of the dataThe range of the dataThe size of the dataThe distribution of the data


Common types of statistics Common types of statistics profileprofileTable profileAttribute profileIndex profile


Typical profilesTypical profiles

Table profile

Cardinality

500

Row size 30

Pages 100

Number of attributes

6

Attribute profile

value 100

Max value 100

Min value 0

Size 5

Data distribution

skew

Index profile

Pages 50

Size 5

Distinct values

50


Three ways to collect statisticsThree ways to collect statistics

Exhaustive accumulationSamplingPiggyback


Exhaustive accumulationExhaustive accumulation

Calculate every statistics describer through scanning the related object exhaustively

AdvantageMost AccurateDisadvantageHeavy system load


SamplingSampling

Scan part of the related object. Estimate statistics through sample dataAdvantageLow system overheadDisadvantageStill have overhead. Statistics is not 100% accurate.


PiggybackPiggyback

Collect statistics through data in memory. Slightly change SQL statement to make full use of these data.Types of piggyback

1.Vertical piggyback

2.Horizontal piggyback

3.Mixed piggyback


Vertical piggybackVertical piggyback

Include extra columns during query processingExample:Select student.name from student;rewrite to:Select student.name,student.age from student;


No extra I/O, but extra cpu load. Solution: set piggyback level1.AC1 = { x| x is a column in Table Ri referenced by Query Q}2.AC2 = { x| x is an index column in Table Ri } – AC13.AC3 = { x| x is a column in Table Ri and x is a part of the primary key or foreign key or referenced by a foreign key}-AC24.AC4 = { x| x is a column in Table Ri }-AC3

Advantage: Choose your piggyback level according to the CPU load


Horizontal piggybackHorizontal piggyback

Include extra rows during query processExample:Select student.name, student.scoreFrom student where score >60;Rewrite to:Select student.name, student.scoreFrom student where score >60 or

student.pid In(Select student.pid for studentWhere score>60); Advantage


Mixed piggybackMixed piggyback

Use both vertical and horizontal piggyback method

Advantage


Value distributionValue distribution

Why we need it?

Example:Select * from StudentWhere score>60;

Size??

Attribute profile: score

Max 100

Min 0

Size 10

Values 101

Distribution table0~10: =1%10~19: =1%20~29: =1%30~39: =3%40~49: =6%50~59: =10%60~69: =10%70~79: =31%80~89: =30%90~100: =10%


Answer:Answer:

Size = 500*0.81*30 = 121.5

Where 500 is the cardinality of the student table. 30 is the size of each record


How to get distribution table?How to get distribution table?

Histogram1. Equal width2. Equal height

0

5

10

15

20

25

30

35

10 20 30 40 50 60 70 80 90 100

Score

Percentage

0

2

4

6

8

10

12

45 56 63 68 73 76 78 85 90 100

Score

Percentage


Bucket numberBucket number

1+ logn [rule of sturge 1927]Example: student table ( 500 records)1+log500 = 10For equal width, put each value into the proper bucketsFor equal height, make an order to the value, if the sampling size is m, decide the height k = m/(bucket number), and put the value in bucket in order


SamplingSampling

How many sample do we need?A sample size of 1064 can give a less than 10% error rate with 99% probability (mannino1988)

To gain same error rate for varies size of table,Sample rate drops when size of table grows.Drop rate: log(n)/nExample:20 sample with 2%error rate on table with 100 recordsWe need 1000*0.2*(1-log(1000)/1000) samples to reach 2% error rate on table with 1000 records


Summery & Future work Summery & Future work

Low overheadLow error rate, still have room to improveThe way to estimate the size of project and

join operations with statistics still need be improved.


The endThe end

statistics profile for query optimization

Documents

cse8330 presentitionhow

cse8330 presentitionwhat

cse8330 presentitionanswer

statistics describer

cse8330 presentitionthree

student table

table ri

record spring