hongtao cheng 1 efficiently supporting ad hoc queries in large datasets of time sequences...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1Hongtao Cheng
Efficiently Supporting Ad Hoc Efficiently Supporting Ad Hoc Queries in Large Datasets of Queries in Large Datasets of
Time SequencesTime Sequences
Author:Flip Korn
H. V. Jagadish
Christos Faloutsos
From
ACM SIGMOD 1997
2Hongtao Cheng
OutlineOutline
Introduction Alternative Methods SVD and SVDD Experiments Conclusions and summary
3Hongtao Cheng
Introduction - datasetsIntroduction - datasets
Datasets we are dealing withN x M matrix, N represents time sequences, M
represents duration of time. Xi,j represents the value.
Matrix is huge(Gigabytes).Number of rows N >> Numbers of columns M. N
O(10^6) M O(100)No Updates on matrix or rare updates
4Hongtao Cheng
Introduction – a sample Introduction – a sample databasedatabase
CustomerWe 7/10
Th 7/11
Fr 7/12
Sa 7/13
Su 7/14
ABC Inc. 1 1 1 0 0DEF Ltd. 2 2 2 0 0GHI Inc. 1 1 1 0 0KLM Co. 5 5 5 0 0Smith 0 0 0 2 2Johnson 0 0 0 3 3Thompson 0 0 0 1 1
Table1: Example of a (customer-day) matrix
Query 1: what was the amount of sales to GHI Inc. on July 11, 1996? Query 2: find the total sales to business customers (ABC, DEF, GHI, and KLM) for the week ending July 12, 1996
5Hongtao Cheng
Introduction …Introduction …
The realityData is compressed.Accessing specific data is very difficult.Decision support and data mining requires the
ability to perform ad hoc queries.
Solution “Processing run” (inefficient, limited, accurate)Quick reconstruction of compressed data.
(efficient, “random access”, loss of accuracy)
SVD is the chosen technique for this paper
6Hongtao Cheng
Alternative methodsAlternative methods
String Compression Clustering Spectral Methods SVD & SVDD
7Hongtao Cheng
String Compression String Compression (lossless)(lossless)
Algorithms: Lempel-Ziv algorithm, Huffman coding, arithmetic coding.
Uncompress the whole database to get the value of a cell in the matrix.
Works fine with continuous stream of queries. Enhancement
Segment the data and compress each segment independently.
Most queries follow a particular form Not effective for real ad hoc querying
8Hongtao Cheng
ClusteringClustering
Algorithm: find the cluster-representative for i-th customer, and return its j-th entry to get value of cell xi,j. In short, xi,j = f( i, j )
Widely used in information retrieval for grouping, pattern matching, social and natural sciences for statistical analysis
Not scale – up in our case. Use off-the-shelf clustering method for the
experiment.
9Hongtao Cheng
Spectral MethodsSpectral Methods
Algorithm: DFT(discrete fourier transform) and other associated methods(DCT, DWT).
Widely used in signal processing. Comparison with SVD
SMs have poor performance for spikes or abrupt jumps of input signals. SVD handles that well.
SVD can be applied to heterogeneous, M-dimensional vectors. SMs can’t.
Use DCT method for the experiment
10Hongtao Cheng
SVD and SVDDSVD and SVDD
SVD – Singular value decomposition Usage:
Statistical analysis Text retrieval Pattern recognition Dimensionality reduction Face recognition Particularly useful in linear regression, matrix approximation
11Hongtao Cheng
SVD – intuition behind SVDSVD – intuition behind SVD
In NxM matrix X, xi,j can be grouped together called “pattern” or “principal component”
For M = 2 in Figure 1, x’ gives the “best” axis to project values.
12Hongtao Cheng
Algorithm of SVD - TheoremAlgorithm of SVD - Theorem
13Hongtao Cheng
Algorithm of SVD – an Algorithm of SVD – an exampleexample
U customer-to-pattern similarity matrixObservation:
V day-to-pattern similarity matrixVj unit vectors correspond to the directions for
optimal projection of the given set of points.
I-th row vector of Ux the coordinates of the ith data vector(“customer”).
14Hongtao Cheng
Algorithm of SVDAlgorithm of SVD
V and pinned in memoryRequires O(k) compute time, independent of N and MOnly one disk access is required to perform this reconstruction
15Hongtao Cheng
Algorithm of SVDDAlgorithm of SVDD Singular Value
Decomposition with Deltas
Maintain a set of triples of the form (row, column, delta)
Delta is difference between the actual value and the value SVD constructs
Clean up gross errors
16Hongtao Cheng
Algorithm of SVDD …Algorithm of SVDD … Data structure of SVDD
U Kopt eighenvalues V Additionally, store kopt triples of the form (row,
column, delta) Reconstruction
One disk access to fetch ith row of U One disk access to fetch delta (using hash table)
Tradeoff: store outlier data
17Hongtao Cheng
ExperimentExperiment Two types of queries
Specific data element query Aggregation query
Two datasets Phone100k (0.2 Gigabytes) Stocks (341 Kbytes)
Error measurement method: RMSPE Four compression methods
Hierarchical clustering: (b*k*M + N*b) bytes for k clusters DCT: (N*k*b) bytes for k coefficients SVD: (N*k+k+k*M) bytes for k principal components SVDD: (N*k+k+k*M+D*O(b)) bytes for k principal components
18Hongtao Cheng
Output comparison of Four Output comparison of Four methodsmethods
SVDD did best. Kopt = Kmax
DCT didn’t do well. It did better in “stocks” than “phone2000” Plain SVD and clustering were close to each other. SVDD gives a satisfactory result.(10:1 CR, 2% ER; 50:1 CR, 10%
ER)
19Hongtao Cheng
Errors of SVD and SVDDErrors of SVD and SVDD
Worst case error in SVD is very large. SVDD bounds the error pretty well.
20Hongtao Cheng
Observation of errorsObservation of errors Steep initial drop in
error. Most matrix cells has
an error substantially less than the mean error RMSPE.
SVDD get rid of the worst case cell error and give a close approximation.
21Hongtao Cheng
Error for aggregate queries Error for aggregate queries (SVDD)(SVDD)
Normalized query error Qerr. 50 queries and approximately 10% of the data cells included. The error was well under 0.5% even with 50:1 CR. Estimates of answers to aggregate queries can be obtained
through sampling.
22Hongtao Cheng
Scale-up (SVDD)Scale-up (SVDD)
Error is around 2% at the 10:1 CRThe graphs are homogeneous.
23Hongtao Cheng
Scale – up (SVD vs. SVDD)Scale – up (SVD vs. SVDD)
Error of SVD increases with dataset size. Error of SVDD remains constant with dataset
size.
24Hongtao Cheng
ConclusionConclusion Lossy compression problem and its solutions
Signal processing Pattern recognition Information retrieval (clustering) Matrix algebra (SVD)
SVD algorithm SVDD properties
Excellent compression rate and Satisfactory result Bound the worst case error of individual data values pretty well Only three passes over the dataset Dimensionality reduction of given dataset. Arbitrary vectors can be handled without additional effort.