approximate computation of multidimensional aggregates of sparse data using wavelets based on the...

Approximate Computation of Approximate Computation of Multidimensional Aggregates of Multidimensional Aggregates of

Sparse Data Using WaveletsSparse Data Using Wavelets

Based on the work ofBased on the work of

Jeffrey Scott VitterJeffrey Scott Vitter

andand

Min WangMin Wang

Guidelines

OverviewOverview PreliminariesPreliminaries The New ApproachThe New Approach The construction of the AlgorithmThe construction of the Algorithm Experiments and ResultsExperiments and Results SummerySummery

The problemThe problem

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many On-Line Analytical Processing (OLAP) applications.

Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in data warehouse environment.

Obviously, it is advantageous to have fast, approximate answers to OLAP aggregation queries.

Processing MethodsProcessing Methods

There are two classes of methods for processing OLAP queries:

Exact MethodsExact Methods

Focus on how to compute the exact data cubeFocus on how to compute the exact data cube

Approximate MethodsApproximate MethodsBecoming attractive in OLAP applications.Becoming attractive in OLAP applications.They have been used in DBMS for a long time.They have been used in DBMS for a long time.In choosing proper approximation technique, there In choosing proper approximation technique, there

arearetwo concernstwo concerns::

EfficiencyEfficiency AccuracyAccuracy

Histograms and Sampling MethodsHistograms and Sampling Methods

Advantage: Advantage:

Simple and natural

Construction procedure is very efficient

Disadvantage:

Inefficient to construct in high dimensional

Can not fit in internal memory

Histograms and sampling are used in a variety of important applications where quick approximations of an array of values are needed.

Use of Wavelet-based techniques to construct analogs of histograms in databases has showed substantial improvements in accuracy over random sampling and other histogram based approaches:

The Intended Solution

Traditional HistogramTraditional Histograminfeasible for massive high dimensional data setsinfeasible for massive high dimensional data sets

Previously Developed Wavelet TechniquePreviously Developed Wavelet Techniqueefficient only for dense dataefficient only for dense data

Previously Approximation TechniquePreviously Approximation Techniquenot accurate enough results for typical queriesnot accurate enough results for typical queries

The proposed method provides approximate answers to high dimensional OLAP aggregation queries in MASSIVE SPARSE DATA SETS in time efficient and space efficient manner.

The Compact Data CubeThe Compact Data Cube

The performance of this method depends in the compact data cube, which is an approximate and space efficient representation of the underlying multidimensional array, based upon multiresolution wavelet decomposition.

In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.

The Data SetThe Data Set

A particular characteristics of the data sets is that they are MASSIVE AND SPARSE

Denotes the set of dimensions

S d-dimensional array which represent the underlying data

Denotes the total size of array Swhere |D i| is the size of dimension D i

Contains the value of the measure attributefor the corresponding combination of the

functional attribute

Is defined to be the number of populated entries in S

} . . . ,{ ,21 dDDDD

) iiiiS d. . . ,,( 321

|D| di1

i

N

ZN

N

NSdensity z)(

Range Sum QueriesRange Sum Queries

An important class of aggregation queries are the so called range sum queries, which are defined by applying the sum operation over a selected continuos range in the domain of some of the attributes.

A range sum query can generally be formulated as follows:

111

), . . . ,(. . . ):, . . . ,:( 111hil hil

ddd

ddd

iiShlhlsum

The d’-Dimensional Range Sum The d’-Dimensional Range Sum QueriesQueries

An interesting subset of the general range sum queries are d’-dimensional range sum queries in which d’<<d.

In this case ranges are specified for only d’ dimensions, and the ranges for the other d-d’ dimensions are implicitly set to be the entire domain

}1||,...,0{ ii Dall

):, . . . ,:( ''11 dd hlhlsum)..., :, . . . ,:( 1'''11 dddd allallhlhlsum

Traditional vs. New approachTraditional vs. New approach

In traditional approaches of answering range sum queries using data cube, all the subcubes of the data cube need to be precomputed and stored. When a query is given, a search is conducted in the data cube and relevant information is fetched.

In the new approach, as usual some preprocessing work is done on the original arrays, but instead of computing and storing all the subcubes, only one, much smaller compact data cube is stored.The compact data cube usually fits in one or small number of disk blocks.

Approximation AdvantagesApproximation Advantages

Storage space for both the precomputation and the storage of the precomputed data cube.

Even when a huge amount of storage space is avaliable and all the data cubes can be stored comfortably, it may take too long to answer a range sun query, since all cells covered by the range need to be accessed.

This approach is preferable to the traditional approaches in two This approach is preferable to the traditional approaches in two important respects:important respects:

I/O ModelI/O Model

The convential parallel disk modelThe convential parallel disk modelRestriction: Restriction: I=1I=1

The Method OutlineThe Method Outline

1. 1. DecompositionDecomposition

2. Thresholding2. Thresholding

3. Reconstruction3. Reconstruction

The method can be divided into three sequential phases:

DecompositionDecomposition

• Compute the wavelet decomposition Compute the wavelet decomposition of the multidimensional array S of the multidimensional array S

• Obtaining a set of C’ wavelet coefficients (C’ ~ NObtaining a set of C’ wavelet coefficients (C’ ~ Nzz))

As in practice, it is assumed that the array is very sparseAs in practice, it is assumed that the array is very sparse

Thresholding and RankingThresholding and Ranking

• Keep only C (Keep only C (C’) wavelet coefficients C’) wavelet coefficients corresponds corresponds to the desired storage usage and to the desired storage usage and accuracy.accuracy.

• Rank only the C wavelet coefficients according to Rank only the C wavelet coefficients according to their importance in the context of accurately their importance in the context of accurately

answering typical aggregation queries.answering typical aggregation queries.

• The C ordered coefficients compose the compact The C ordered coefficients compose the compact data cube.data cube.

ReconstructionReconstruction

Notes:Notes: More accurate answers can be provided upon More accurate answers can be provided upon

request.request. The efficiency is crucial, since it affects the The efficiency is crucial, since it affects the

query response time directly.query response time directly.

• In the on line phase, an aggregation query is In the on line phase, an aggregation query is processed by using the K most significant processed by using the K most significant coefficients to reconstruct an approximate coefficients to reconstruct an approximate answer.answer.

• The choice of K depends upon the time the user The choice of K depends upon the time the user is willing to spend.is willing to spend.

Wavelet DecompositionWavelet Decomposition

Wavelets are a mathematical tool for the hierarchical Wavelets are a mathematical tool for the hierarchical decomposition of functions in a space efficient matter.decomposition of functions in a space efficient matter.

HAAR Wavelets:HAAR Wavelets:

• Conceptually very simple wavelet basis functionsConceptually very simple wavelet basis functions• fast to computefast to compute• easy to implementeasy to implement

HAAR Wavelet - ExampleHAAR Wavelet - Example

Suppose we have a one dimensional signal of N=8 data items

S = [2,2,0,2,3,5,4,4]

By repeating this process recursively on the average, we get the full decomposition:

[2,1,4,4,0,-1,-1,00,-1,-1,0]

Wavelet transform

Wavelet TransformWavelet Transform

The individual entries are called the wavelet coefficients. Coefficients at the lower resolution are weighted more

than the one at the higher resolution. The decomposition is very efficient:The decomposition is very efficient:

O(n) CPU timeO(n) CPU time O(N/B) I/OsO(N/B) I/Os

The wavelet transform is a single coefficient representing the over all average of the original signal , followed by the detail coefficients in the order of increasing resolution

],,, [ S 0,1,100,12ˆ ,2

1

4

1

4

3

Increasing resolution

1. Partition the d dimensions into g groups, for some 1gd

Building The Compact Data CubeBuilding The Compact Data Cube

The goal of this step is to compute the wavelet decomposition of the multidimensional array S, obtaining a set of C’ wavelet coefficients.

} . . . ,{ 21 11 jjj iiij DDDG WhereWhere i i00=0 i=0 igg=d=d

GGjj must satisfy must satisfyB

|| . . . |||| 21 11

MDxDxD

jjj iii

2. The algorithm for constructing the compact data cube consists of g passes: GGjj is read into memory is read into memory multidimensional decomposition is performedmultidimensional decomposition is performed results are written out to be used for the next passresults are written out to be used for the next pass

Eliminating Intermediate ResultsEliminating Intermediate Results

One problem is that the density of the intermediate results will increase from pass to pass, since performing wavelet decomposition on sparse data usually results in more nonzero data.

The natural solution is truncation keeping roughly only Nz entries

Learning process:

• During each pass, an on-line statistics of wavelet coefficients are kept to maintain cutoff value.

• Any entry with its absolute value below the cutoff value will be thrown away on the fly.

Thresholding and RankingThresholding and Ranking

Given the storage limitation for the compact data cube, it is possible to keep only several number of wavelet coefficients:

letC’ - number of wavelet coefficients.C - number of wavelet coefficients that can be stored.

Since C<<C’, the goal is to determine which are the best C coefficients to keep, so as to minimize the error of approximation.

P-normP-norm

Once the error rate is decided for individual queries, it is meaningful to choose a norm by which to measure the error of a collection of queries.

let ) . . . ,,( 21 Qeeee be the vector of error over a sequence of Q queries.

Choosing the CoefficientsChoosing the Coefficients

Choosing the C largest (absolute value) wavelet coefficientsafter normalization is provably optimal in minimizing the 2-norm.

But if coefficient Ci is more likely to contribute more than another one then its w(C) will be greater, where:

k

jjicw

1

]0[)(

Finally:

1. Pick C’’ (C<C’’<C’) largest wavelet coefficients

2. Among the C’’ coefficients choose the C with the largest weight

3. Order the C coefficients in decreasing order to get the compact data cube.

Answering On-Line QueriesAnswering On-Line Queries

Mirrors the wavelet transform It is bottom up process S(l:H) denotes the range sum between s(l) and s(h)

The error tree is built based upon the wavelet transform procedure.

h

li

iShlS )():(

Constructing The Original Signal

The original signal S can be constructed from the tree nodes by the following formulas:

)6(ˆ )3(ˆ )1(ˆ)0(ˆ)5(

)6(ˆ )3(ˆ )1(ˆ)0(ˆ)4(

)5(ˆ )2(ˆ)1(ˆ)0(ˆ)3(

)5(ˆ )2(ˆ)1(ˆ)0(ˆ)2(

)4(ˆ )2(ˆ)1(ˆ)0(ˆ)1(

)4(ˆ )2(ˆ)1(ˆ)0(ˆ)0(

SSSSS

SSSSS

SSSSS

SSSSS

SSSSS

SSSSS

Not all terms are always being evaluated, only the true contributors are quickly evaluated for answering a query.

Answering A QueryAnswering A Query

To answer a query form ):, . . . ,:( ''11 dd hlhlsum

Using k coefficients,

Of the compact data cube R, the following algorithm is used:

AnswerQuery(R,k,l1,h1,…,ld’,hd’)answer = 0;

for I=1,2…k do

if Contribute(R[I], l1,h1,…,ld’,hd’)answer=answer +

Compute_Contribute (R[I], l1,h1,…,ld’,hd’)for j=d’+1,….,d do

answer = answer x |Dj|

return answer ;

Experiments DescriptionExperiments Description

The experimental results were performed using real-world data from the U.S. Census Bureau.

• The data file contains 372 attributes. Measure attribute is income. Functional attributes include among others: age, sex, education, race, origin.

• Although the dimensions size are generally small, the highdimensionality results in 10-dimensional array with more than 16,000,000 cells, density~0.001, Nz=15,985.

• Platform:Digital Alpha work station running Digital unix 4.0512 MB internal memory (only 1-10 MB are used for the program)logical block transfer size 2*4 KB

Experiments Sets - Experiments Sets - variable densityvariable density

Dimensions groups were partitioned to satisfy M/2B condition For all data sets g=2 the small differences in running time were mainly caused by

the on-line cutoff effect.

6206

6

1021016

44

8

10

10

N

MBS

MBM

N

d

z

Experiments Sets - Experiments Sets - fixed densityfixed density

Running time scales almost linearly with respect to the input data size

66 101610

001.0

MB 70444

8

10

zN

density

S

MBM

d

Accuracy of the Approximations Accuracy of the Approximations AnswersAnswers

Comparison with traditional histogram has no meaning, because they are too inefficient to construct for high dimensional data.

Comparison with random sampling algorithms depends on the distribution of the non zero entries (random sampling performs better for uniform distribution).

SummerySummery

A new wavelet technique for approximate answer to an OLAP range sum queries was presented.

Four important issues were discussed and resolved:

I/O efficiency of the data cube construction, especially when the underlying multidimensional array is very sparse.

Response time in answering an on-line query

Accuracy in answering typical OLAP queries.

Progressive refinement

approximate computation of multidimensional aggregates of sparse data using wavelets based on the...

Documents