high-dimensional indexing based on dimensionality reduction
DESCRIPTION
High-dimensional Indexing based on Dimensionality Reduction . Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi. Outlines. Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/1.jpg)
High-dimensional Indexing based on Dimensionality
Reduction
Students: Qing ChenHeng Tao ShenSun Ji Chun
Advisor: Professor Beng Chin Ooi
![Page 2: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/2.jpg)
Outlines Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space Effects of Dimensionality Reduction Behaviors of Distance Matrices Conclusion and Future Works
![Page 3: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/3.jpg)
Introduction High-Dim Applications:
Multimedia, time-series, scientific, market basket, etc.
Various Trees Proposed: R-tree, R*, R+, X, Skd, SS, M, KDB, TV, Buddy,
Grid File, Hybrid, iDistance, etc. Dimensionality Curse
Efficiency drops quickly as dim increases.
![Page 4: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/4.jpg)
Introduction Dimensionality Reduction Techniques
GDR LDR
High-Dim Indexing on RDS Existing Indexing on single RDS Global Indexing on multiple RDS
Side Effects of DR Different Behaviors of Distance Matrices Conclusion Future Work
![Page 5: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/5.jpg)
GDR Perform Reduction on the whole
dataset.
![Page 6: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/6.jpg)
GDRImproving query accuracy by doing principal components analysis (PCA)
![Page 7: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/7.jpg)
GDR Using Aggregate Data for Reduction
in Dynamic Spaces [8].
![Page 8: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/8.jpg)
GDR Works for Globally Correlated data. GDR may cause significant info loss
in real data.
![Page 9: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/9.jpg)
LDR [5] Find locally correlated data clusters Perform dimensionality reduction on
on the clusters individually
![Page 10: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/10.jpg)
LDR - Definitions Cluster and subspace Reconstruction Distance
![Page 11: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/11.jpg)
LDR - Constraints on cluster Reconstruction distance bound
I.e. MaxReconDist Dimensionality bound
I.e. MaxDim Size Bound
I.e. MinSize
![Page 12: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/12.jpg)
LDR - Clustering Algo Construct spatial clusters
Determine max number of clusters: M Determine the cluster range: e Choose a set of well scattered points as
the centroids (C) of each spatial cluster Apply the formula to all data points:
Distance (P, Cclosest) <= e Update the centroids of the cluster
![Page 13: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/13.jpg)
LDR - Clustering Algo (cont) Compute principal component (PC)
Perform PCA individually to all clusters Compute mean value of each cluster
points, I.e. Ei Determine subspace dimensionality
Progressively checking each point against: MaxReconDist and MaxDim
Decide the optimal demensionality for each cluster
![Page 14: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/14.jpg)
LDR - Clustering Algo (cont) Recluster points
Insert each points into the a suitable cluster or the outlier set OI.e. ReconDist(P.S) <= MaxReconDist
![Page 15: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/15.jpg)
LDR - Clustering Algo (cont) Finally, apply the Size Bound to
eliminate clusters with too few population. Redistribute the points to other clusters or set O.
![Page 16: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/16.jpg)
LDR - Compare to GDR LDR improves retrieval efficiency
and effectiveness by capture more details on local data set.
But it consumes higher computational cost during the reduction steps.
![Page 17: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/17.jpg)
LDR LDR cannot discover all the possible
correlated clusters.
![Page 18: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/18.jpg)
Indexing RDS GDR
One RDS only Applying existing multi-dim indexing
structure, e.g. R-tree, M-Tree… LDR
Several RDS in different axis systems Global Indexing Structure
![Page 19: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/19.jpg)
Global IndexingEach RDS corresponds to one tree.
![Page 20: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/20.jpg)
Side Effects of DR Information loss -> Lower precision Possible Improvement?
Text Domain DR -> qualitative improvement
Least information loss -> highest precision -> Highest qualitative improvement
![Page 21: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/21.jpg)
Side Effects of DR Latent Semantic Indexing (U & V) (LSI)
[9,10,11]TXX
Sim for docXX T
Sim for term & correlation
![Page 22: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/22.jpg)
Side Effects of DR DR effectively improve the data
representation by understanding the data in terms of concepts rather than words.
Directions with greatest variance results in the use of Semantic aspects of data.
![Page 23: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/23.jpg)
Side Effects of DR Dependency among attributes
results in poor measurements if using L-norm matrices.
Dimensions with largest eigenvalues = highest quality [2].
So what else we have to consider?.
Inter-correlations
![Page 24: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/24.jpg)
Mahalanobis Distance
Normalized Mahalanobis Distance
![Page 25: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/25.jpg)
Mahalanobis vs. L-norm
![Page 26: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/26.jpg)
Mahalanobis vs. L-norm Take local shape into consideration by
computing variance and covariance. Tend to group points into elliptical
clusters, which defines a multi-dim space whose boundaries determine the range of degree of correlation that is suitable for dim reduction.
Define the standard deviation boundary of the cluster.
![Page 27: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/27.jpg)
Incremental Ellipse aims to discover all the possible
correlated clusters with different size, density and elongation.
![Page 28: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/28.jpg)
Behaviors of Distance Matrices in High–dim Space KNN is meaningful in high-dim
space? [1] Furthest Neighbor/Nearest Neighbor is
almost 1 -> poor discrimination [4] One criterion as relative contrast:
kd
kd
kd
D
DD
min
minmax
![Page 29: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/29.jpg)
Behaviors of Distance Matrices in High–dim Space
on different dimensionality for different matrices
kd
kd DD minmax
![Page 30: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/30.jpg)
Behaviors of Distance Matrices in High–dim Space Relative Contrast on L-norm Matrices
![Page 31: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/31.jpg)
Behaviors of Distance Matrices in High–dim Space For higher dimensionality, the
relative contrast provided by a norm with smaller parameter is more likely to dominate another with a larger parameter.
So L-norm Matrices with smaller parameter is a better choice for KNN searching in high-dim space.
![Page 32: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/32.jpg)
Conclusion Two Dimensionality Reduction Methods
GDR LDR
Indexing Methods Existing Structure Global Indexing Structure
Side Effects of DR Qualitative Improvement Both intra-variance and inter-variance
Different behaviors for different matrices Smaller k achieves higher quality
![Page 33: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/33.jpg)
Future work Propose a new Tree for real high
dimensional indexing without reduction for dataset without correlations? (Beneath iDistance, further prune the
searching sphere using LB-Tree)? Reduce the dim of data points which are
the combination of multi-features, such as images (shape, color, text, etc).
![Page 34: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/34.jpg)
References [1]: Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising
Behavior of Distance Metrics in High Dimensional Spaces. ICDT 2001:420-434 [2]: Charu C. Aggarwal: On the Effects of Dimensionality Reduction on High
Dimensional Similarity Search. PODS 2001 [3]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the
Nearest Neighbor in High Dimensional Spaces? VLDB 2000: 506-515 [4]: K.Beyer, J.Goldstein, R.Ramakrishnan, and U.Shaft.When is nearest neighbors
meaningful? ICDT, 1999. [5]: K.Chakrabart and S.Mehrotra.Local Dimensionality Reduction: A New
Approach to Indexing High Dimensional Spaces.VLDB, pages 89--100, 2000. [6]: R.Weber, H.Schek, and S.Blott. A Quantitative Analysis and Performance
Study for Similarity Search Methods in High Dimensional Spaces. VLDB, pages 194--205, 1998.
[7]: C.Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. VLDB, 2001.
![Page 35: High-dimensional Indexing based on Dimensionality Reduction](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816835550346895dddecf4/html5/thumbnails/35.jpg)
References [8]: K. V. R. Kanth, D. Agrawal, and A. K. Singh. Dimensionality reduction for
similarity searching dynamic databases. SIGMOD, 1998. [9]: Jon M. Kleinberg, Andrew Tomkins: Applications of Linear Algebra in
Information Retrieval and Hypertext Analysis. PODS 1999: 185-193 [10]: Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh
Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: 159-168
[11]: Chris H.Q. Ding. A similarity-based Probability model for latent semantic indexing. SIGIR 1999: 59-65
[12]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? VLDB 2000