aggregate function computation and iceberg querying in vertical databases yue (jenny) cui advisor:...

32
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department of Computer Science North Dakota State University

Upload: posy-briggs

Post on 08-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Review of Iceberg Queries Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold. We use an example. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T We illustrate the procedure of calculating by three steps. Step one: Generate Location-list. SELECT Location, Sum (# Product) FROM Relation Sales GROUPBY Location HAVING Sum (# Product) >= T Step Two: Generate Product Type-list. SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T Step Three: Generate location & Product Type pair groups. From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

TRANSCRIPT

Page 1: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Aggregate Function Computation and Iceberg Querying in Vertical

Databases

Yue (Jenny) CuiAdvisor: Dr. William Perrizo

Master Thesis Oral DefenseDepartment of Computer Science

North Dakota State University

Page 2: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

IntroductionAn aggregate on T is functional on 2T (i.e., a map, F:2TR, R = real numbers).Common include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, TOP-K.

There are 3 types of aggregate functions: Let T be a set, let G be a numeric aggregate (i.e., aggregates an set of numbers into one number) and let S={Si}i=1…n be a partition of T (i.e., collectively exhaustive and mutually exclusive: Ui=1..nSi=T and Sj∩Si = ij).

1. Distributive Aggregates: An aggregate, F, of T is G-distributive if partition, S, of T, G-aggregating the F-aggregates of S is the same as F-aggregating T. (i.e., F(T)=G{F(Si)} S={Si}).

– SUM and COUNT are SUM-distributive (F=SUM or F=COUNT, G=SUM)– MIN is MIN-distributive– MAX is MAX-distributive

• An aggregate, F, is self-distributive iff it is F-distributive– e.g., SUM, MIN, MAX, but not COUNT– What about AVG, MEDIAN, RANK, TOP-K?

2. Algebraic Aggregates: An Aggregate, F, of T is algebraic if there is an M-tuple valued function K and a function H such that F(T)=H({K(Si)} i=1..n. Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic.

3. Holistic Aggregates: An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

Page 3: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Review of Iceberg Queries• Iceberg queries perform aggregate functions across attributes and then eliminate aggregate

values that are below some specified threshold. We use an example.

SELECT Location, Product Type, Sum (# Product)FROM Relation Sales GROUPBY Location, Product TypeHAVING Sum (# Product) >= T

We illustrate the procedure of calculating by three steps. Step one: Generate Location-list.

SELECT Location, Sum (# Product)FROM Relation Sales GROUPBY LocationHAVING Sum (# Product) >= T Step Two: Generate Product Type-list.

SELECT Type, Sum (# Product)FROM Relation Sales GROUPBY Product TypeHAVING Sum (# Product) >= T

Step Three: Generate location & Product Type pair groups.

From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

Page 4: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithms of Aggregate Function Computation Using P-trees

Id Mon Loc Type On line # Product

1 Jan New York Notebook Y 10

2 Jan Minneapolis Desktop N 5

3 Feb New York Printer Y 6

4 Mar New York Notebook Y 7

5 Mar Minneapolis Notebook Y 11

6 Mar Chicago Desktop Y 9

7 Apr Minneapolis Fax N 3

The dataset we used in our example.

We use the data in relation Sales to illustrate algorithms of aggregate function.

Table 1. Relation Sales.

Page 5: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithms of Aggregate Function Computation Using P-trees (Cont.)

Id Mon Loc Type On line # Product

P0,3 P0,2 P0,1 P0,0 P1,4 P1,3 P1,2 P1,1 P1,0 P2,2 P2,1 P2,0 P3,0 P4,3 P4,2 P4,1 P4,0

1 0001 00001 001 1 1010

2 0001 00101 010 0 0101

3 0010 00001 100 1 0110

4 0011 00001 001 1 0111

5 0011 00101 001 1 1011

6 0011 00110 010 1 1001

7 0100 00101 101 0 0011

Table 2 shows the binary representation of data in relation Sales.

Table 2. Binary Form of Sales.

Page 6: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function COUNT

• COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree Pi, RootCount(Pi) returns the number of 1s in Pi.

Id Mon Loc Type On line # Product

P0,3 P0,2 P0,1 P0,0 P1,4 P1,3 P1,2 P1,1 P1,0 P2,2 P2,1 P2,0 P3,0 P4,3 P4,2 P4,1 P4,0

1 0001 00001 001 1 1010

2 0001 00101 010 0 0101

3 0010 00001 100 1 0110

4 0011 00001 001 1 0111

5 0011 00101 001 1 1011

6 0011 00110 010 1 1001

7 0100 00101 101 0 0011

Table 1. Relation Sales.

Page 7: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function SUM

• SUM function: Sum function can total a field of numerical values.

Algorithm 4.1 Evaluating sum () with P-tree.total = 0.00;For i = 0 to n {

total = total + 2i * RootCount (Pi);}Return total

Algorithm 4. 1. Sum Aggregate

Page 8: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function SUM

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{3}

{3}

{5}

{5}

23 * + 22 * + 21 * + 20 * = 51

For example, if we want to know the total number of products which were sold out in relation S, the procedure is showed on left

10

5

6

7

11

9

3

Page 9: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function AVERAGE

• Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM.

Average () = Sum ()/Count ().

Page 10: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function MAX

• Max function: Max function returns the largest value in a field.

Algorithm 4.2 Evaluating max () with P-tree.max = 0.00;c = 0;Pc is set all 1sFor i = n to 0 { c = RootCount (Pc AND Pi); If (c >= 1) Pc = Pc AND Pi;

max = max + 2i; } Return max;

Algorithm 4. 2. Max Aggregate.

Page 11: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function MAX

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{1}{0}

{1}

{1}

1. Pc = P4,3

RootCount (Pc) = 3 >= 12. RootCount (Pc AND P4,2) = 0 < 1

Pc = Pc AND P’4,2

3. RootCount (Pc AND P4,1 ) = 2 >= 1

Pc = Pc AND P4,1

4. RootCount (Pc AND P4,0 ) = 1 >= 1

10

5

6

7

11

9

3

Steps IF Pos Bits

23 * + 22 * + 21 * + 20 * = {1} {0} {1} {1} 11

Page 12: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function MIN

• Min function: Min function returns the smallest value in a field.

Algorithm 4.3. Evaluating Min () with P-tree.min = 0.00;c = 0;Pc is set all 1sFor i = n to 0 { c = RootCount (Pc AND NOT (Pi)); If (c >= 1)

Pc = Pc AND NOT (Pi); Else min = min + 2i; } Return min;

Algorithm 4. 2. Max Aggregate.

Page 13: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function MIN

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{0}

{1}

{1}

1. Pc = P’4,3

RootCount (Pc) = 4 > = 1

2. RootCount (Pc AND P’4,2) = 1 >= 1

Pc = Pc AND P’4,2

3. RootCount (Pc AND P’4,1 ) = 0 < 1

Pc = Pc AND P4,1

4. RootCount (Pc AND P’4,0 ) = 0 < 1

10

5

6

7

11

9

3

Steps IF Pos Bits

23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1} 3

Page 14: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithms of Aggregate Function MEDIAN and RANK

• Median/Rank: Median function returns the median value in a field.

• Rank (K) function returns the value that is the kth largest value in a field.

Algorithm 4.4. Evaluating Median () with P-treemedian = 0.00;pos = N/2; for rank pos = K;c = 0;Pc is set all 1s for single attributeFor i = n to 0 { c = RootCount (Pc AND Pi); If (c >= pos)

median = median + 2i; Pc = Pc AND Pi;

Else pos = pos - c;

Pc = Pc AND NOT (Pi);}Return median;Algorithm 4. 2. Median Aggregate.

Page 15: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function MEDIAN

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{1}

{1}

{1}

1. Pc = P4,3

RootCount (Pc) = 3 < 4

Pc = P’4,3

pos = 4 – 3 = 1 2. RootCount (Pc AND P4,2) = 3 >= 1

Pc = Pc AND P4,2

3. RootCount (Pc AND P4,1 ) = 2 >= 1

Pc = Pc AND P4,1

4. RootCount (Pc AND P4,0 ) = 1 >= 1

10

5

6

7

11

9

3

Steps IF Pos Bits

23 * + 22 * + 21 * + 20 * = {0} {1} {1} {1} 7

Page 16: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Algorithm of Aggregate Function TOP-K

• Top-k function: In order to get the largest k values in a field, first, we will find rank k value Vk using function Rank (K).

• Second, we will find all the tuples whose values are greater than or equal to Vk. Using ENRING technology of P-tree

Page 17: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-rees

• We demonstrate the computation procedure of iceberg querying with the following example:

SELECT Loc, Type, Sum (# Product)FROM Relation SGROUPBY Loc, TypeHAVING Sum (# Product) >= 15

Page 18: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step One)

• Step one: We build value P-trees for the 4 values, {Loc| New York, Minneapolis, Chicago}, of attribute Loc.

PMN

0100101

PNY

1011000

PCH

0000010

Figure 4. Value P-trees of Attribute Loc

Page 19: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step One)

LOC 0 0 0 0 1 P1,4 P1,3 P1,2 P1.1 P1.0 P’1,4 P’1,3 P’1,2 P’1.1 P1.0 PNY

0000000

0000000

0100111

0000010

1111101

1111111

1111111

1011000

1111101

1111101

1011000

Figure 5. Procedure of Calculating PNY

Figure 5 illustrates the calculation procedure of value P-tree PNY. Because the binary value of New York is 00001, we will get formula 1. PNY = P’1,4 AND P’1,3 AND P’1,2 AND P’1,1 AND P1,0 (1)

Page 20: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step One)

• After getting all the value P-trees for each location, we calculate the total number of products sold in each place. We still use the value, New York, as our example.

Sum(# product | New York) = 23 * RootCount (P4,3 AND PNY) +

22 * RootCount (P4,2 AND PNY) +

21 * RootCount (P4,1 AND PNY) + 20 * RootCount (P4,0 AND PNY)

= 8 * 1 + 4 * 2 + 2 * 3 + 1 * 1 = 23 (2)

Page 21: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step One)

Loc Values Sum (# Product) Threshold

New York 23 Y

Minneapolis 18 Y

Chicago 9 N

Table 3 shows the total number of products sold out in each of the three of the locations. Because our threshold T is 15, we eliminate the city Chicago.

Table 3. the Summary Table of Attribute Loc.

Page 22: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Two)

• Step two: Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type | Notebook, desktop, Printer, Fax}. Figure 6 shows the value P-tree of the four values of attribute Type.

1001100

0100010

0010000

0000001

PNotebook PDesktop PPrinter PFAX

Figure 6. Value P-trees of Attribute Type.

Page 23: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Two)

Type Values Sum (# Product) Threshold

Notebook 28 Y

Desktop 14 N

FAX 3 N

Printer 6 N

•Similarly we get the summary table for each value of attribute Type.

•According to the threshold, T equals 15, only value P-tree of notebook will be used in the future.

Table 4. Summary Table of Attribute Type.

Page 24: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Three)

• Step three: We only generate candidate Loc and Type pairs for local store and Product type, which can pass the threshold T. By Performing And operation on PNY with PNotebook, we obtain value P-tree

PNY AND Notebook

1011000

1001100

1001000

PNY PNotebook PNY AND Notebook

AND =

Figure 7. Procedure of Calculating PNY AND Notebook

Page 25: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Three)

• We calculate the total number of notebooks sold out in New York by formula 3.

Sum(# Product | New York) = 23 * RootCount (P4,3 AND PNY AND Notebook) + 22 * RootCount (P4,2 AND PNY AND Notebook) +

21 * RootCount (P4,1 AND PNY AND Notebook) +

20 * RootCount (P4,0 AND PNY AND Notebook) = 8 * 1 + 4 * 1 + 2 * 2 + 1* 1 = 17 (3)

Page 26: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Three)

• By performing And operations on PMN with

P Notebook, we obtain value P-tree PMN AND Notebook

0100101

1001100

0000100

PMN PNotebook PMN AND Notebook

AND =

Figure 8. Procedure of Calculating PMN AND Notebook

Page 27: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Three)

• We calculate the total number of notebook sold out in Minneapolis by formula 4.

Sum (# product | Minneapolis) = 23 * RootCount (P4,3 AND PMN AND Notbook) + 22 * RootCount (P4,2 AND PMN AND Notbook) +

21 * RootCount (P4,1 AND PMN AND Notbook) +

20 * RootCount (P4,0 AND PMN AND Notbook) = 8 * 1 + 4 * 0 + 2 * 1 + 1 * 1 = 11 (4)

Page 28: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Iceberg Query Operation Using P-trees (Step Three)

• Finally, we obtain the summary table 5. According to the threshold T=15, we can see that only group pair “New York And Notebook” pass our threshold T. From value P-tree PNY AND Notebook, we can see that tuple 1 and 4 are in the results of our iceberg query example.

Type Values Sum (# Product) Threshold

New York And Notebook 17 Y

Minneapolis And Notebook 11 N

Table 5. Summary Table of Our Example.

1001000

PNY AND Notebook

Page 29: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Performance Analysis

020406080

100120140160180

100 200 400 500 600Number of tuples (k)

Run

time

(Sec

ond)

P-tree Bitmap Index

Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

Page 30: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Performance Analysis• Our experiments are implemented in the C++ language on a

1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux.

• In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

Page 31: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Conclusion• we believe our study confirms that the P-tree approach is superior to

the bitmap approach for aggregation of all types and multi-attribute iceberg queries.

• It also proves that the advantages of basic P-tree representations of files are:

– First, there is no need for redundant, auxiliary structures.

– Second basic P-trees are good at calculating multi-attribute aggregations, numeric value, and fair to all attributes.

Page 32: Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department

Thank you !