hongtao cheng 1 efficiently supporting ad hoc queries in large datasets of time sequences...

24
1 Hongtao Cheng Efficiently Supporting Ad Hoc Efficiently Supporting Ad Hoc Queries in Large Datasets of Queries in Large Datasets of Time Sequences Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM SIGMOD 1997

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

1Hongtao Cheng

Efficiently Supporting Ad Hoc Efficiently Supporting Ad Hoc Queries in Large Datasets of Queries in Large Datasets of

Time SequencesTime Sequences

Author:Flip Korn

H. V. Jagadish

Christos Faloutsos

From

ACM SIGMOD 1997

Page 2: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

2Hongtao Cheng

OutlineOutline

Introduction Alternative Methods SVD and SVDD Experiments Conclusions and summary

Page 3: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

3Hongtao Cheng

Introduction - datasetsIntroduction - datasets

Datasets we are dealing withN x M matrix, N represents time sequences, M

represents duration of time. Xi,j represents the value.

Matrix is huge(Gigabytes).Number of rows N >> Numbers of columns M. N

O(10^6) M O(100)No Updates on matrix or rare updates

Page 4: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

4Hongtao Cheng

Introduction – a sample Introduction – a sample databasedatabase

CustomerWe 7/10

Th 7/11

Fr 7/12

Sa 7/13

Su 7/14

ABC Inc. 1 1 1 0 0DEF Ltd. 2 2 2 0 0GHI Inc. 1 1 1 0 0KLM Co. 5 5 5 0 0Smith 0 0 0 2 2Johnson 0 0 0 3 3Thompson 0 0 0 1 1

Table1: Example of a (customer-day) matrix

Query 1: what was the amount of sales to GHI Inc. on July 11, 1996? Query 2: find the total sales to business customers (ABC, DEF, GHI, and KLM) for the week ending July 12, 1996

Page 5: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

5Hongtao Cheng

Introduction …Introduction …

The realityData is compressed.Accessing specific data is very difficult.Decision support and data mining requires the

ability to perform ad hoc queries.

Solution “Processing run” (inefficient, limited, accurate)Quick reconstruction of compressed data.

(efficient, “random access”, loss of accuracy)

SVD is the chosen technique for this paper

Page 6: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

6Hongtao Cheng

Alternative methodsAlternative methods

String Compression Clustering Spectral Methods SVD & SVDD

Page 7: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

7Hongtao Cheng

String Compression String Compression (lossless)(lossless)

Algorithms: Lempel-Ziv algorithm, Huffman coding, arithmetic coding.

Uncompress the whole database to get the value of a cell in the matrix.

Works fine with continuous stream of queries. Enhancement

Segment the data and compress each segment independently.

Most queries follow a particular form Not effective for real ad hoc querying

Page 8: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

8Hongtao Cheng

ClusteringClustering

Algorithm: find the cluster-representative for i-th customer, and return its j-th entry to get value of cell xi,j. In short, xi,j = f( i, j )

Widely used in information retrieval for grouping, pattern matching, social and natural sciences for statistical analysis

Not scale – up in our case. Use off-the-shelf clustering method for the

experiment.

Page 9: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

9Hongtao Cheng

Spectral MethodsSpectral Methods

Algorithm: DFT(discrete fourier transform) and other associated methods(DCT, DWT).

Widely used in signal processing. Comparison with SVD

SMs have poor performance for spikes or abrupt jumps of input signals. SVD handles that well.

SVD can be applied to heterogeneous, M-dimensional vectors. SMs can’t.

Use DCT method for the experiment

Page 10: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

10Hongtao Cheng

SVD and SVDDSVD and SVDD

SVD – Singular value decomposition Usage:

Statistical analysis Text retrieval Pattern recognition Dimensionality reduction Face recognition Particularly useful in linear regression, matrix approximation

Page 11: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

11Hongtao Cheng

SVD – intuition behind SVDSVD – intuition behind SVD

In NxM matrix X, xi,j can be grouped together called “pattern” or “principal component”

For M = 2 in Figure 1, x’ gives the “best” axis to project values.

Page 12: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

12Hongtao Cheng

Algorithm of SVD - TheoremAlgorithm of SVD - Theorem

Page 13: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

13Hongtao Cheng

Algorithm of SVD – an Algorithm of SVD – an exampleexample

U customer-to-pattern similarity matrixObservation:

V day-to-pattern similarity matrixVj unit vectors correspond to the directions for

optimal projection of the given set of points.

I-th row vector of Ux the coordinates of the ith data vector(“customer”).

Page 14: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

14Hongtao Cheng

Algorithm of SVDAlgorithm of SVD

V and pinned in memoryRequires O(k) compute time, independent of N and MOnly one disk access is required to perform this reconstruction

Page 15: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

15Hongtao Cheng

Algorithm of SVDDAlgorithm of SVDD Singular Value

Decomposition with Deltas

Maintain a set of triples of the form (row, column, delta)

Delta is difference between the actual value and the value SVD constructs

Clean up gross errors

Page 16: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

16Hongtao Cheng

Algorithm of SVDD …Algorithm of SVDD … Data structure of SVDD

U Kopt eighenvalues V Additionally, store kopt triples of the form (row,

column, delta) Reconstruction

One disk access to fetch ith row of U One disk access to fetch delta (using hash table)

Tradeoff: store outlier data

Page 17: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

17Hongtao Cheng

ExperimentExperiment Two types of queries

Specific data element query Aggregation query

Two datasets Phone100k (0.2 Gigabytes) Stocks (341 Kbytes)

Error measurement method: RMSPE Four compression methods

Hierarchical clustering: (b*k*M + N*b) bytes for k clusters DCT: (N*k*b) bytes for k coefficients SVD: (N*k+k+k*M) bytes for k principal components SVDD: (N*k+k+k*M+D*O(b)) bytes for k principal components

Page 18: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

18Hongtao Cheng

Output comparison of Four Output comparison of Four methodsmethods

SVDD did best. Kopt = Kmax

DCT didn’t do well. It did better in “stocks” than “phone2000” Plain SVD and clustering were close to each other. SVDD gives a satisfactory result.(10:1 CR, 2% ER; 50:1 CR, 10%

ER)

Page 19: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

19Hongtao Cheng

Errors of SVD and SVDDErrors of SVD and SVDD

Worst case error in SVD is very large. SVDD bounds the error pretty well.

Page 20: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

20Hongtao Cheng

Observation of errorsObservation of errors Steep initial drop in

error. Most matrix cells has

an error substantially less than the mean error RMSPE.

SVDD get rid of the worst case cell error and give a close approximation.

Page 21: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

21Hongtao Cheng

Error for aggregate queries Error for aggregate queries (SVDD)(SVDD)

Normalized query error Qerr. 50 queries and approximately 10% of the data cells included. The error was well under 0.5% even with 50:1 CR. Estimates of answers to aggregate queries can be obtained

through sampling.

Page 22: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

22Hongtao Cheng

Scale-up (SVDD)Scale-up (SVDD)

Error is around 2% at the 10:1 CRThe graphs are homogeneous.

Page 23: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

23Hongtao Cheng

Scale – up (SVD vs. SVDD)Scale – up (SVD vs. SVDD)

Error of SVD increases with dataset size. Error of SVDD remains constant with dataset

size.

Page 24: Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM

24Hongtao Cheng

ConclusionConclusion Lossy compression problem and its solutions

Signal processing Pattern recognition Information retrieval (clustering) Matrix algebra (SVD)

SVD algorithm SVDD properties

Excellent compression rate and Satisfactory result Bound the worst case error of individual data values pretty well Only three passes over the dataset Dimensionality reduction of given dataset. Arbitrary vectors can be handled without additional effort.