approximate computation of multidimensional aggregates of sparse data using wavelets based on the...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Approximate Computation of Approximate Computation of Multidimensional Aggregates of Multidimensional Aggregates of
Sparse Data Using WaveletsSparse Data Using Wavelets
Based on the work ofBased on the work of
Jeffrey Scott VitterJeffrey Scott Vitter
andand
Min WangMin Wang
Guidelines
OverviewOverview PreliminariesPreliminaries The New ApproachThe New Approach The construction of the AlgorithmThe construction of the Algorithm Experiments and ResultsExperiments and Results SummerySummery
The problemThe problem
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many On-Line Analytical Processing (OLAP) applications.
Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in data warehouse environment.
Obviously, it is advantageous to have fast, approximate answers to OLAP aggregation queries.
Processing MethodsProcessing Methods
There are two classes of methods for processing OLAP queries:
Exact MethodsExact Methods
Focus on how to compute the exact data cubeFocus on how to compute the exact data cube
Approximate MethodsApproximate MethodsBecoming attractive in OLAP applications.Becoming attractive in OLAP applications.They have been used in DBMS for a long time.They have been used in DBMS for a long time.In choosing proper approximation technique, there In choosing proper approximation technique, there
arearetwo concernstwo concerns::
EfficiencyEfficiency AccuracyAccuracy
Histograms and Sampling MethodsHistograms and Sampling Methods
Advantage: Advantage:
Simple and natural
Construction procedure is very efficient
Disadvantage:
Inefficient to construct in high dimensional
Can not fit in internal memory
Histograms and sampling are used in a variety of important applications where quick approximations of an array of values are needed.
Use of Wavelet-based techniques to construct analogs of histograms in databases has showed substantial improvements in accuracy over random sampling and other histogram based approaches:
The Intended Solution
Traditional HistogramTraditional Histograminfeasible for massive high dimensional data setsinfeasible for massive high dimensional data sets
Previously Developed Wavelet TechniquePreviously Developed Wavelet Techniqueefficient only for dense dataefficient only for dense data
Previously Approximation TechniquePreviously Approximation Techniquenot accurate enough results for typical queriesnot accurate enough results for typical queries
The proposed method provides approximate answers to high dimensional OLAP aggregation queries in MASSIVE SPARSE DATA SETS in time efficient and space efficient manner.
The Compact Data CubeThe Compact Data Cube
The performance of this method depends in the compact data cube, which is an approximate and space efficient representation of the underlying multidimensional array, based upon multiresolution wavelet decomposition.
In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.
The Data SetThe Data Set
A particular characteristics of the data sets is that they are MASSIVE AND SPARSE
Denotes the set of dimensions
S d-dimensional array which represent the underlying data
Denotes the total size of array Swhere |D i| is the size of dimension D i
Contains the value of the measure attributefor the corresponding combination of the
functional attribute
Is defined to be the number of populated entries in S
} . . . ,{ ,21 dDDDD
) iiiiS d. . . ,,( 321
|D| di1
i
N
ZN
N
NSdensity z)(
Range Sum QueriesRange Sum Queries
An important class of aggregation queries are the so called range sum queries, which are defined by applying the sum operation over a selected continuos range in the domain of some of the attributes.
A range sum query can generally be formulated as follows:
111
), . . . ,(. . . ):, . . . ,:( 111hil hil
ddd
ddd
iiShlhlsum
The d’-Dimensional Range Sum The d’-Dimensional Range Sum QueriesQueries
An interesting subset of the general range sum queries are d’-dimensional range sum queries in which d’<<d.
In this case ranges are specified for only d’ dimensions, and the ranges for the other d-d’ dimensions are implicitly set to be the entire domain
}1||,...,0{ ii Dall
):, . . . ,:( ''11 dd hlhlsum)..., :, . . . ,:( 1'''11 dddd allallhlhlsum
Traditional vs. New approachTraditional vs. New approach
In traditional approaches of answering range sum queries using data cube, all the subcubes of the data cube need to be precomputed and stored. When a query is given, a search is conducted in the data cube and relevant information is fetched.
In the new approach, as usual some preprocessing work is done on the original arrays, but instead of computing and storing all the subcubes, only one, much smaller compact data cube is stored.The compact data cube usually fits in one or small number of disk blocks.
Approximation AdvantagesApproximation Advantages
Storage space for both the precomputation and the storage of the precomputed data cube.
Even when a huge amount of storage space is avaliable and all the data cubes can be stored comfortably, it may take too long to answer a range sun query, since all cells covered by the range need to be accessed.
This approach is preferable to the traditional approaches in two This approach is preferable to the traditional approaches in two important respects:important respects:
I/O ModelI/O Model
The convential parallel disk modelThe convential parallel disk modelRestriction: Restriction: I=1I=1
The Method OutlineThe Method Outline
1. 1. DecompositionDecomposition
2. Thresholding2. Thresholding
3. Reconstruction3. Reconstruction
The method can be divided into three sequential phases:
DecompositionDecomposition
• Compute the wavelet decomposition Compute the wavelet decomposition of the multidimensional array S of the multidimensional array S
• Obtaining a set of C’ wavelet coefficients (C’ ~ NObtaining a set of C’ wavelet coefficients (C’ ~ Nzz))
As in practice, it is assumed that the array is very sparseAs in practice, it is assumed that the array is very sparse
Thresholding and RankingThresholding and Ranking
• Keep only C (Keep only C (C’) wavelet coefficients C’) wavelet coefficients corresponds corresponds to the desired storage usage and to the desired storage usage and accuracy.accuracy.
• Rank only the C wavelet coefficients according to Rank only the C wavelet coefficients according to their importance in the context of accurately their importance in the context of accurately
answering typical aggregation queries.answering typical aggregation queries.
• The C ordered coefficients compose the compact The C ordered coefficients compose the compact data cube.data cube.
ReconstructionReconstruction
Notes:Notes: More accurate answers can be provided upon More accurate answers can be provided upon
request.request. The efficiency is crucial, since it affects the The efficiency is crucial, since it affects the
query response time directly.query response time directly.
• In the on line phase, an aggregation query is In the on line phase, an aggregation query is processed by using the K most significant processed by using the K most significant coefficients to reconstruct an approximate coefficients to reconstruct an approximate answer.answer.
• The choice of K depends upon the time the user The choice of K depends upon the time the user is willing to spend.is willing to spend.
Wavelet DecompositionWavelet Decomposition
Wavelets are a mathematical tool for the hierarchical Wavelets are a mathematical tool for the hierarchical decomposition of functions in a space efficient matter.decomposition of functions in a space efficient matter.
HAAR Wavelets:HAAR Wavelets:
• Conceptually very simple wavelet basis functionsConceptually very simple wavelet basis functions• fast to computefast to compute• easy to implementeasy to implement
HAAR Wavelet - ExampleHAAR Wavelet - Example
Suppose we have a one dimensional signal of N=8 data items
S = [2,2,0,2,3,5,4,4]
By repeating this process recursively on the average, we get the full decomposition:
[2,1,4,4,0,-1,-1,00,-1,-1,0]
Wavelet transform
Wavelet TransformWavelet Transform
The individual entries are called the wavelet coefficients. Coefficients at the lower resolution are weighted more
than the one at the higher resolution. The decomposition is very efficient:The decomposition is very efficient:
O(n) CPU timeO(n) CPU time O(N/B) I/OsO(N/B) I/Os
The wavelet transform is a single coefficient representing the over all average of the original signal , followed by the detail coefficients in the order of increasing resolution
],,, [ S 0,1,100,12ˆ ,2
1
4
1
4
3
Increasing resolution
1. Partition the d dimensions into g groups, for some 1gd
Building The Compact Data CubeBuilding The Compact Data Cube
The goal of this step is to compute the wavelet decomposition of the multidimensional array S, obtaining a set of C’ wavelet coefficients.
} . . . ,{ 21 11 jjj iiij DDDG WhereWhere i i00=0 i=0 igg=d=d
GGjj must satisfy must satisfyB
|| . . . |||| 21 11
MDxDxD
jjj iii
2. The algorithm for constructing the compact data cube consists of g passes: GGjj is read into memory is read into memory multidimensional decomposition is performedmultidimensional decomposition is performed results are written out to be used for the next passresults are written out to be used for the next pass
Eliminating Intermediate ResultsEliminating Intermediate Results
One problem is that the density of the intermediate results will increase from pass to pass, since performing wavelet decomposition on sparse data usually results in more nonzero data.
The natural solution is truncation keeping roughly only Nz entries
Learning process:
• During each pass, an on-line statistics of wavelet coefficients are kept to maintain cutoff value.
• Any entry with its absolute value below the cutoff value will be thrown away on the fly.
Thresholding and RankingThresholding and Ranking
Given the storage limitation for the compact data cube, it is possible to keep only several number of wavelet coefficients:
letC’ - number of wavelet coefficients.C - number of wavelet coefficients that can be stored.
Since C<<C’, the goal is to determine which are the best C coefficients to keep, so as to minimize the error of approximation.
P-normP-norm
Once the error rate is decided for individual queries, it is meaningful to choose a norm by which to measure the error of a collection of queries.
let ) . . . ,,( 21 Qeeee be the vector of error over a sequence of Q queries.
Choosing the CoefficientsChoosing the Coefficients
Choosing the C largest (absolute value) wavelet coefficientsafter normalization is provably optimal in minimizing the 2-norm.
But if coefficient Ci is more likely to contribute more than another one then its w(C) will be greater, where:
k
jjicw
1
]0[)(
Finally:
1. Pick C’’ (C<C’’<C’) largest wavelet coefficients
2. Among the C’’ coefficients choose the C with the largest weight
3. Order the C coefficients in decreasing order to get the compact data cube.
Answering On-Line QueriesAnswering On-Line Queries
Mirrors the wavelet transform It is bottom up process S(l:H) denotes the range sum between s(l) and s(h)
The error tree is built based upon the wavelet transform procedure.
h
li
iShlS )():(
Constructing The Original Signal
The original signal S can be constructed from the tree nodes by the following formulas:
)6(ˆ )3(ˆ )1(ˆ)0(ˆ)5(
)6(ˆ )3(ˆ )1(ˆ)0(ˆ)4(
)5(ˆ )2(ˆ)1(ˆ)0(ˆ)3(
)5(ˆ )2(ˆ)1(ˆ)0(ˆ)2(
)4(ˆ )2(ˆ)1(ˆ)0(ˆ)1(
)4(ˆ )2(ˆ)1(ˆ)0(ˆ)0(
SSSSS
SSSSS
SSSSS
SSSSS
SSSSS
SSSSS
Not all terms are always being evaluated, only the true contributors are quickly evaluated for answering a query.
Answering A QueryAnswering A Query
To answer a query form ):, . . . ,:( ''11 dd hlhlsum
Using k coefficients,
Of the compact data cube R, the following algorithm is used:
AnswerQuery(R,k,l1,h1,…,ld’,hd’)answer = 0;
for I=1,2…k do
if Contribute(R[I], l1,h1,…,ld’,hd’)answer=answer +
Compute_Contribute (R[I], l1,h1,…,ld’,hd’)for j=d’+1,….,d do
answer = answer x |Dj|
return answer ;
Experiments DescriptionExperiments Description
The experimental results were performed using real-world data from the U.S. Census Bureau.
• The data file contains 372 attributes. Measure attribute is income. Functional attributes include among others: age, sex, education, race, origin.
• Although the dimensions size are generally small, the highdimensionality results in 10-dimensional array with more than 16,000,000 cells, density~0.001, Nz=15,985.
• Platform:Digital Alpha work station running Digital unix 4.0512 MB internal memory (only 1-10 MB are used for the program)logical block transfer size 2*4 KB
Experiments Sets - Experiments Sets - variable densityvariable density
Dimensions groups were partitioned to satisfy M/2B condition For all data sets g=2 the small differences in running time were mainly caused by
the on-line cutoff effect.
6206
6
1021016
44
8
10
10
N
MBS
MBM
N
d
z
Experiments Sets - Experiments Sets - fixed densityfixed density
Running time scales almost linearly with respect to the input data size
66 101610
001.0
MB 70444
8
10
zN
density
S
MBM
d
Accuracy of the Approximations Accuracy of the Approximations AnswersAnswers
Comparison with traditional histogram has no meaning, because they are too inefficient to construct for high dimensional data.
Comparison with random sampling algorithms depends on the distribution of the non zero entries (random sampling performs better for uniform distribution).
SummerySummery
A new wavelet technique for approximate answer to an OLAP range sum queries was presented.
Four important issues were discussed and resolved:
I/O efficiency of the data cube construction, especially when the underlying multidimensional array is very sparse.
Response time in answering an on-line query
Accuracy in answering typical OLAP queries.
Progressive refinement