adbms seminar report

27
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING A Seminar Report on EFFICIENT ICEBERG QUERY EVALUATION USING BITMAP INDICES By Student Name: Om Pawar Roll No: 3253 Class: TE Guided By Internal Guide Name Prof. A. Phakatkar P: F-SMR-UG/08/R0

Upload: om-pawar

Post on 07-Nov-2014

34 views

Category:

Documents


6 download

DESCRIPTION

Iceberg Query evaluation using bitmap indexes

TRANSCRIPT

Page 1: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

A Seminar Report

on

EFFICIENT ICEBERG QUERY EVALUATION USING BITMAP INDICES

By

Student Name: Om PawarRoll No: 3253

Class: TE

Guided By

Internal Guide NameProf. A. Phakatkar

Computer Engineering DepartmentAcademic Year: 2012-2013

P: F-SMR-UG/08/R0

Page 2: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that Mr./Miss. Om Dilip Pawar, Roll No.3253 a

student of T.E.(Computer Engineering Department) Batch

2012-13, has satisfactorily completed a seminar report on

“Efficient Iceberg Query Evaluation Using Compressed Bitmap

Index.” under the guidance of Prof. A.Phakatkar towards the

partial fulfillment of the Third Year Computer Engineering,

Semester II of the Pune University.

------------------ ---------------------- Internal Guide Head of Department,

Computer Engineering

Date:-

Place:-

P: F-SMR-UG/08/R0

Page 3: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

Abstract:

Decision support and knowledge discovery systems often compute aggregate

values of interesting attributes by processing a huge amount of data in very large

databases and/or warehouses. Iceberg query is a special type of aggregation query that

computes aggregate values above a user-provided threshold. Most existing iceberg

query processing algorithms do not take advantage of the small-result-set property and

rely heavily on the tuple-scan-based approach. This incurs intensive disk accesses and

computation, resulting in long processing time especially when data size is large.

Bitmap index, which builds one bitmap vector for each attribute value, is

gaining popularity in both column-oriented and row-oriented databases in recent years.

It occupies less space than the raw data and gives opportunities for more efficient query

processing. Bitmap indices have the advantages of leveraging the antimonotone

property of iceberg queries to enable aggressive index pruning strategies. The index-

pruning-based approach introduced in this paper eliminates the need of scanning and

processing the entire data set (table) and thus speeds up the iceberg query processing

significantly. Experiments show that this approach is much more efficient than existing

algorithms commonly used in row-oriented and column-oriented databases.

Keywords:

Iceberg query, Bitmap index, Dynamic Pruning

P: F-SMR-UG/08/R0

Page 4: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

INTRODUCTION

Business insight and knowledge discovery from operational data are powerful

weapons for gaining competitive advantages in the modern business world. To discover busi-

ness insights, analysts often compute aggregate values over one or more attributes in large

databases (warehouses). Iceberg query [4] is a special class of aggregation query, which

computes aggregate values above a given threshold. It is of special interest to the users, as

high frequency events or high aggregate values often carry more important information.

The general form of an iceberg query on a relationR(C1,C2,…….,Cn) is :

SELECT Ci,Cj,….,Cm,AGG(*) FROM RGROUP BY Ci,Cj,……,Cm

HAVING AGG(*)>=T

Queries which are used to compute aggregate values over an attribute(or set of

attributes) above a given threshold are called iceberg queries because the number of results

above the threshold is often very small (the tip of an iceberg), relative to the large amount of

input data (the iceberg).With the threshold constraint, an iceberg query usually only returns a

very small percentage of distinct groups as the output, which resembles the tip of an iceberg.

Because of the small result set, iceberg queries can potentially be answered quickly even

over a very large data set. However, current database systems and/or approaches do not fully

take advantage of this feature of iceberg query.

The relational database systems nowadays are all using general aggregation algo-

rithms to answer iceberg queries by first aggregating all tuples and then evaluating the HAV-

ING clause to select the iceberg result. For large data set, multipass aggregation algorithms

are used when the full aggregate result cannot fit in memory (even when the final iceberg re-

sult is small). Most existing query optimization techniques for processing iceberg queries [4]

can be categorized as the tuple-scan-based approach, which requires at least one table scan to

read data from disk.

P: F-SMR-UG/08/R0

Page 5: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

Iceberg query can be evaluated efficiently using bitmap indices. Bitmap indices pro-

vide a vertical organization of a column using bitmap vectors. Bitmap indices operate on bits

rather than real tuple values. Bitwise operations are very fast to execute and can often be ac-

celerated by hardware.

P: F-SMR-UG/08/R0

Page 6: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

RELATED WORK

Processing of Iceberg query is first defined and studied by Fang et al. in 1998 [4].

Fang proposed the Hybrid and Multibuckets algorithms by extending the probabilistic

techniques proposed. Sampling/bucketing method is used to predict valid groups, with

possible false positives and false negatives. Then, efficient strategies are designed to

efficiently correct false positives and false negatives to retrieve the exact result.

In data warehouses conducted studies on computing iceberg cube, which computes and

materializes cells of a data cube satisfying specified condition. These works focus on

selecting a proper order of computing aggregation over all combination of aggregate

attributes, to maximize sharing of the computation. The focus of answering iceberg queries is

to speed up the processing time of single iceberg query. The focus of computing iceberg

cubes, such that of, is to maximize the shared computation to shorten the cube generation

time. Developing efficient iceberg query answering algorithm is necessary. These algorithms

can be leveraged to generate iceberg cube more efficiently. Bitmap indices are known to be

efficient, especially for read-mostly or append-only data, and are commonly used in the data

warehousing applications and column stores. Various compression schemes for bitmap index

have been developed. Word-Aligned Hybrid (WAH) [3]and Byte-aligned Bitmap Code

(BBC) are two important compression schemes that can be applied to any column and be

used in query processing without decompression.

Model 204 was the first commercial product making extensive use of the bitmap

index. Early bitmap indices are used to implement inverted files. In data warehouse

applications, bitmap indices are shown to perform better than tree-based index schemes, such

as the variants of B-tree or R-tree. Compressed bitmap indices are widely used in column-

oriented databases, such as C-Store, which contribute to the performance gain of column

databases over row-oriented databases.

The development of bitmap compression methods, and encoding strategies further

broaden the applicability of bitmap index. Nowadays, it can be applied on all types of

attributes (e.g., high-cardinality categorical attributes numeric attributes and text attributes).

However, bitmap index is not effectively leveraged in existing works to process iceberg

queries. In this paper, a novel iceberg query processing algorithm is introduced using bitmap

indices, which are shown to be highly effective.

P: F-SMR-UG/08/R0

Page 7: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

PROGRAMMER’S DESIGN

BITMAP INDEX AND ITS COMPRESSION

A bitmap for an attribute (column) of a table can be viewed as a v × r matrix, where v

is the number of distinct values of the column and r is the number of tuples (rows) in the

table. Each value in the column corresponds to a bitmap vector of length r, in which the kth

position of the vector is 1 if this value appears in the kth row and 0 otherwise.

e.g.:-

A B CA2 B2 1.23A1 B3 2.34A2 B1 5.36A2 B2 8.36A1 B3 3.27A2 B1 9.45A2 B2 6.23A2 B1 1.98A1 B3 8.23A2 B2 0.11A3 B1 3.44A3 B1 2.08

(a)Table R (b)Bitmap Indices of A,B

P: F-SMR-UG/08/R0

A1 A2 A30 1 01 0 00 1 00 1 01 0 00 1 00 1 00 1 01 0 00 1 00 0 10 0 1

B1 B2 B30 1 00 0 11 0 00 1 00 0 11 0 00 1 01 0 00 0 10 1 01 0 01 0 0

Page 8: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

DYNAMIC PRUNING

With bitmap indices, it is easy to calculate the total occurrences of a single value (us-

ing its bitmap vector) without accessing other data. The antimonotone property can be lever-

aged to quickly prune bitmap vectors that will not produce valid iceberg results.

First, we introduce a new bitwise-AND operation, which carries out the following

three actions in one bitwise-AND operation between vectors X and Y:

Z = X AND Y

X = X XOR Z

Y = Y XOR Z

Besides generating the resulting vector Z of the bitwise-AND operation, the operation also

sets the 1 bit in the original vectors to 0, if the corresponding bit in the resulting vector is 1.

After each bitwise-AND operation, the dynamic pruning strategy adds an extra prun-

ing step of monitoring the number of remaining 1s in both bitmap vectors involved. If the

number of 1 bits of a modified vector becomes smaller than the iceberg threshold, this vector

can be pruned. That is, no further AND operation is necessary for this vector. With dynamic

pruning, the number of AND operations can be reduced effectively, since the iceberg thresh-

old is usually large.

The dynamic pruning strategy works fine for attributes with a relatively small number

of unique values. However, its performance downgrades severely due to the empty bitwise-

AND results problem. With the dynamic index pruning strategy alone, many of the bitwise-

AND operations produce empty results after a bitwise-AND operation. That is, the resulting

bitmap vector contains no bits having value 1. Such bitwise-AND operations are fruitless in

two aspects:

1) They do not produce valid iceberg result.

2) They do not reduce the number of 1 bits in original vectors for index pruning purpose.

VECTOR ALIGNMENT

To overcome this challenge of empty bitwise-AND results problem, the vector

alignment algorithm is developed. For the dynamic pruning algorithm, the worst case bound

of the number of bitwise-AND operations needed is equal to the product of the numbers of

P: F-SMR-UG/08/R0

Page 9: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

distinct values of all aggregate attributes, which would be much larger than the number of

tuples.

Definition:

First 1-bit position: It refers to the position of the first 1-bit in a bitmap vector.

Definition:

Vector alignment: Two bitmap vectors are aligned if their first 1-bit positions are the same.

If two vectors are aligned, their bitwise-AND result will not be empty, because they

have at least one overlapping position.

1. For each aggregate attribute, priority queue of its bitmap vectors prioritized by

their first 1-bit positions is built. Then, the top bitmap vector from each priority Queue is

chosen and checked whether they can be aligned. If they are, it means the resulting bitmap

vector of the bitwise-AND operation between these two vectors will not be empty.

Thus a bitwise-AND operation is carried out and the dynamic pruning strategy is applied.

2. The above process is repeated until at least one queue is empty.

3. In case when the two top bitmap vectors are not aligned, because one of the two

bitmap vectors might have been pruned already, the vector which has the smaller first 1-bit

position is selected and all 1-bits with positions smaller than the first 1-bit of the other bitmap

vector are reset. These bits can be safely removed (reset) and the fist 1-bit position of the se-

lected vector is recomputed because they will not have corresponding matching bits in the re-

maining vectors of the other queue.

Let S be the Set representing the System

S= {I, O, P, Sc, Fc}

Where I=input

O=output

P=Processes

Sc=Success Case

Fc=Failure case.

P: F-SMR-UG/08/R0

Page 10: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

I= {R, Q}

Where R=Relation R(C1,C2,…….,Cn)

Q=Query

O= {Iceberg Results}

P= {Calculate query results according to conditions}

Sc= {Proper Iceberg Results}

Fc= {Improper Iceberg Results}

Algorithm 1: Iceberg Processing with Vector Alignment and Dynamic Pruning

iceberg PQ (attribute A, attribute B, threshold T)

Output: iceberg results

1: PQA.clear, PQB.clear

2: for each vector a of attribute A do

3: a.count = BIT1 COUNT (a)

4: if a.count >= T then

5: a.next1 =first1BitPosition (a, 0)

6: PQA.push (a)

7: for each vector b of attribute B do

8: b.count = BIT1_ COUNT (b)

9: if b.count >= T then

10: b.next1 = first1BitPosition(b, 0)

11: PQB.push(b)

12: R =0;

13: a, b = nextAlignedVectors(PQA, PQB; T)

14: while a ≠ null and b ≠ null do

15: PQA.pop

16: PQB.pop

P: F-SMR-UG/08/R0

Page 11: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

17: r = BITWISE_AND(a, b)

18: if r.count >= T then

19: Add iceberg result (a.value, b.value, r.count) into R

20: a.count = a.count – r.count

21: b.count =b.count – r.count

22: if a.count >= T then

23: a.next1 = first1BitPosition(a, a.next1 + 1)

24: if a.next1 ≠ null then

25: PQA:push(a)

26: if b.count >= T then

27: b.next1 = first1BitPosition(b, b.next1 + 1)

28: if b.next1 ≠ null then

29: PQB:push(b)

30: a, b = nextAlignedVectors(PQA, PQB, T)

31: return R

Algorithm 2:Computing First 1 bit position

first1BitPosition (bitmap vector vec, start position pos)

Output: The position of the first 1 bit position in vector, starting

from position pos

1: len =0

2: for each word w in vector vec do

3: if w is a literal word then

4: if len <= pos AND len + 31 > pos then

5: for p = pos to len + 30 do

6: if position p is 1 then

7: return p

8: else if len > pos then

9: for p = len to len + 30 do

10: if position p is 1 then

11: return p

P: F-SMR-UG/08/R0

Page 12: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

12: len += 31

13: else if w is a 0 fill word then

14: fillLength = length of this fill word

15: len += fillLength * 31

16: else

17: fillLength = length of this fill word

18: len += fillLength * 31

19: if len > pos then

20: return pos

21: return null

Algorithm 3:Find the nextAlignedVectors

nextAlignedVectors (priority queue PQA, priority queue PQB, threshold T)

Output: Two aligned vectors a ε PQa, b ε PQb

1: while PQA is not empty and PQB is not empty do

2: a = PQA.top

3: b = PQB.top

4: if a.next1 = b.next1 then

5: return a, b

6: if a.next1 > b.next1 then

7: PQB.pop

8: b.next1, skip = first1BitPositionWithSkip(b, a,next1)

9: b.count = b:count - skip

10: if b.next1 ≠ null AND b.count >= T then

11: PQB.push(b)

12: else

13: PQA.pop

14: a.next1, skip = first1BitPositionWithSkip(a, b.next1)

15: a.count = a.count - skip

16: if a.next1 ≠ null AND a.count >= T then

P: F-SMR-UG/08/R0

Page 13: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

17: PQA.push(a)

18: return null, null

Generalization

It is easy to extend algorithm icebergPQ to support iceberg queries on more than two

attributes because iceberg queries have the antimonotone property. Therefore, when there are

multiple aggregate attributes, two attributes can be dealt at a time.

The icebergPQ algorithm can be also generalized to support other aggregation

functions which have the antimonotone property. For example, to support SUM function,

rather than computing the count of 1-bits for each vector, the sum of the values

corresponding to the 1 bits in the resulting bitmap vector are computed. When index pruning

is conducted, the vectors are pruned by the sum of all values corresponding to 1 bits left in

the vector, rather than the number of 1 bits. Other parts of the icebergPQ algorithm are kept

the same. Because the antimonotone property of iceberg queries is still valid for SUM, our

algorithm is still correct. Besides SUM, for MIN(MAX) functions, the modification is similar

since MIN(MAX) also operates on numeric values as SUM function. The minor difference is

that after each bitwise-AND operation, rather than computing the sum value, the min(max)

value is computed. Then, the min(max) value is used for index pruning.

P: F-SMR-UG/08/R0

Page 14: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

PERFORMANCE ANALYSIS OF VECTOR ALIGNMENT

Comparing to the dynamic pruning algorithm, icebergPQ is much more efficient.

Given a table R(A,B) with n tuples. Suppose A has s unique values, B has t unique values,

and group by operation on A, B forms g groups. Here g represents the number of valid

groups that appear at least once in the relation.

It is clear that

s<= g <= n

t <= g <= n.

Theoretically, the worst case of dynamic pruning algorithm needs to compare all pairs of

vectors in the two attributes, if no dynamic pruning is effective. Hence, the worst case perfor-

mance of dynamic pruning algorithm is s × t, which could be much slower than scanning the

table itself.

Whereas, icebergPQ only processes AND operations on aligned vectors. That is, each

AND operation corresponds to a real group on A, B. Therefore, the worst case of icebergPQ

is equal to the number of groups g, which is often much smaller than the table size n.

The effect of pruning becomes quite significant in icebergPQ, since it makes the number of

AND operations much smaller than g in practice. Optimization strategies can further reduce

the execution time of AND operations.

P: F-SMR-UG/08/R0

Page 15: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

EXPERIMENTAL EVALUATION

The experiments are conducted on a machine with a Pentium 4 single core processor of 3.6

GHz, 2.0 GB main memory and 7,200 rpm IDE hard drive, running Ubuntu 9.10 with ext4

file system. Experiments were carried out with both a synthetic data set and a real patent data

set. In the experiment, assumption is made that the bitmap indexes of the aggregation at-

tributes have already been built offline. This is a reasonable assumption, since other than ice-

berg queries, bitmap indexes are useful for many other tasks especially in column-oriented

databases. In this suite of experiments, icebergDP and icebergPQ was tested, on data sets

with zipfian distribution. The data size was varied from 1 to 8 million tuples the performance

of icebergPQ is magnitudes faster than icebergDP. It demonstrates the severe performance is-

sue triggered by the empty bitwise-AND results problem discussed before. With 1 million

tuples, icebergPQ only needs 0.404 seconds to finish processing, while icebergDP needs

10.688 seconds. icebergPQ also scales well when the data size increases. It only takes 11.36

second with 8 million tuples, while icebergDP takes more than 18 minutes. The performance

of icebergDP is unacceptable for practical data sizes.

P: F-SMR-UG/08/R0

Page 16: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

Fig:Performance of icebergDP and icebergPQ

Fig b:Normal distribution

P: F-SMR-UG/08/R0

Page 17: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

CONCLUSION

This paper presents an efficient algorithm for iceberg query processing using

compressed bitmap indices. This algorithm demonstrates superior performance over existing

schemes and it does not depend on any particular compression method. It has been observed

that bitmap index has three attractive advantages:

1) Saving disk access by avoiding tuple-scan on a table with a lot of attributes,

2) Saving computation time by conducting bitwise operations, and

3) Leveraging the antimonotone property of iceberg queries to develop aggressive

pruning strategies.

The problem of massive bitwise-AND operations was solved by vector alignment.

Both analysis and experiments verify the effectiveness of this approach and show that this

algorithm can outperform the state-of-the-art algorithms for iceberg query processing.

This algorithm is not sensitive to the number of distinct values, number of attributes

in the relation and the length of individual attributes. It works well on data sets with zipfian

distribution. The performance of this algorithm is better when the query is more “iceberg-

P: F-SMR-UG/08/R0

Page 18: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

like.” That is, when the threshold of the iceberg query is relatively large (which means the

percentage of the iceberg results is relatively small). It also works better when the number of

aggregation attribute is relatively small.

REFRENCES1. “Iceberg query evaluation using bitmap index”.Bin He, Hui-I Hsiao, Member, IEEE,

Ziyang Liu, Yu Huang, and Yi Chen, Member, IEEE,2012.2. F. Delie`ge and T.B. Pedersen, “Position List Word Aligned Hybrid: Optimizing

Space and Performance for Compressed Bitmaps,” Proc. Int’l Conf. Extending Data-base Technology (EDBT), pp. 228-239, 2010.

3. A. Ferro, R. Giugno, P.L. Puglisi, and A. Pulvirenti, “BitCube: A Bottom-Up Cubing Engineering,” Proc. Int’l Conf. Data Warehousing and Knowledge Discovery (DaWaK), pp. 189-203, 2009.

4. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D.Ullman, “Comput-ing Iceberg Queries Efficiently,” Proc. Int’l Conf.Very Large Data Bases (VLDB), pp. 299-310, 1998K. Wu, E.J. Otoo, and A. Shoshani, “Optimizing Bitmap Indices with Efficient Compression,” ACM Trans. Database Systems, vol. 31, no. 1, pp. 1-38, 2006.

P: F-SMR-UG/08/R0

Page 19: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

STC PROGRESS REPORT

Roll No:Name:Class:Sr. No. Date Topic of Discussion Remarks of Guide Guide’s sign

P: F-SMR-UG/08/R0

Page 20: ADBMS Seminar Report

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, DEPARTMENT OF COMPUTER ENGINEERING

P: F-SMR-UG/08/R0