high-dimensional indexing based on dimensionality reduction

High-dimensional Indexing based on Dimensionality

Reduction

Students: Qing ChenHeng Tao ShenSun Ji Chun

Advisor: Professor Beng Chin Ooi

Outlines Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space Effects of Dimensionality Reduction Behaviors of Distance Matrices Conclusion and Future Works

Introduction High-Dim Applications:

Multimedia, time-series, scientific, market basket, etc.

Various Trees Proposed: R-tree, R*, R+, X, Skd, SS, M, KDB, TV, Buddy,

Grid File, Hybrid, iDistance, etc. Dimensionality Curse

Efficiency drops quickly as dim increases.

Introduction Dimensionality Reduction Techniques

GDR LDR

High-Dim Indexing on RDS Existing Indexing on single RDS Global Indexing on multiple RDS

Side Effects of DR Different Behaviors of Distance Matrices Conclusion Future Work

GDR Perform Reduction on the whole

dataset.

GDRImproving query accuracy by doing principal components analysis (PCA)

GDR Using Aggregate Data for Reduction

in Dynamic Spaces [8].

GDR Works for Globally Correlated data. GDR may cause significant info loss

in real data.

LDR [5] Find locally correlated data clusters Perform dimensionality reduction on

on the clusters individually

LDR - Definitions Cluster and subspace Reconstruction Distance

LDR - Constraints on cluster Reconstruction distance bound

I.e. MaxReconDist Dimensionality bound

I.e. MaxDim Size Bound

I.e. MinSize

LDR - Clustering Algo Construct spatial clusters

Determine max number of clusters: M Determine the cluster range: e Choose a set of well scattered points as

the centroids (C) of each spatial cluster Apply the formula to all data points:

Distance (P, Cclosest) <= e Update the centroids of the cluster

LDR - Clustering Algo (cont) Compute principal component (PC)

Perform PCA individually to all clusters Compute mean value of each cluster

points, I.e. Ei Determine subspace dimensionality

Progressively checking each point against: MaxReconDist and MaxDim

Decide the optimal demensionality for each cluster

LDR - Clustering Algo (cont) Recluster points

Insert each points into the a suitable cluster or the outlier set OI.e. ReconDist(P.S) <= MaxReconDist

LDR - Clustering Algo (cont) Finally, apply the Size Bound to

eliminate clusters with too few population. Redistribute the points to other clusters or set O.

LDR - Compare to GDR LDR improves retrieval efficiency

and effectiveness by capture more details on local data set.

But it consumes higher computational cost during the reduction steps.

LDR LDR cannot discover all the possible

correlated clusters.

Indexing RDS GDR

One RDS only Applying existing multi-dim indexing

structure, e.g. R-tree, M-Tree… LDR

Several RDS in different axis systems Global Indexing Structure

Global IndexingEach RDS corresponds to one tree.

Side Effects of DR Information loss -> Lower precision Possible Improvement?

Text Domain DR -> qualitative improvement

Least information loss -> highest precision -> Highest qualitative improvement

Side Effects of DR Latent Semantic Indexing (U & V) (LSI)

[9,10,11]TXX

Sim for docXX T

Sim for term & correlation

Side Effects of DR DR effectively improve the data

representation by understanding the data in terms of concepts rather than words.

Directions with greatest variance results in the use of Semantic aspects of data.

Side Effects of DR Dependency among attributes

results in poor measurements if using L-norm matrices.

Dimensions with largest eigenvalues = highest quality [2].

So what else we have to consider?.

Inter-correlations

Mahalanobis Distance

Normalized Mahalanobis Distance

Mahalanobis vs. L-norm

Mahalanobis vs. L-norm Take local shape into consideration by

computing variance and covariance. Tend to group points into elliptical

clusters, which defines a multi-dim space whose boundaries determine the range of degree of correlation that is suitable for dim reduction.

Define the standard deviation boundary of the cluster.

Incremental Ellipse aims to discover all the possible

correlated clusters with different size, density and elongation.

Behaviors of Distance Matrices in High–dim Space KNN is meaningful in high-dim

space? [1] Furthest Neighbor/Nearest Neighbor is

almost 1 -> poor discrimination [4] One criterion as relative contrast:

kd

kd

kd

D

DD

min

minmax

Behaviors of Distance Matrices in High–dim Space

on different dimensionality for different matrices

kd

kd DD minmax

Behaviors of Distance Matrices in High–dim Space Relative Contrast on L-norm Matrices

Behaviors of Distance Matrices in High–dim Space For higher dimensionality, the

relative contrast provided by a norm with smaller parameter is more likely to dominate another with a larger parameter.

So L-norm Matrices with smaller parameter is a better choice for KNN searching in high-dim space.

Conclusion Two Dimensionality Reduction Methods

GDR LDR

Indexing Methods Existing Structure Global Indexing Structure

Side Effects of DR Qualitative Improvement Both intra-variance and inter-variance

Different behaviors for different matrices Smaller k achieves higher quality

Future work Propose a new Tree for real high

dimensional indexing without reduction for dataset without correlations? (Beneath iDistance, further prune the

searching sphere using LB-Tree)? Reduce the dim of data points which are

the combination of multi-features, such as images (shape, color, text, etc).

References [1]: Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising

Behavior of Distance Metrics in High Dimensional Spaces. ICDT 2001:420-434 [2]: Charu C. Aggarwal: On the Effects of Dimensionality Reduction on High

Dimensional Similarity Search. PODS 2001 [3]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the

Nearest Neighbor in High Dimensional Spaces? VLDB 2000: 506-515 [4]: K.Beyer, J.Goldstein, R.Ramakrishnan, and U.Shaft.When is nearest neighbors

meaningful? ICDT, 1999. [5]: K.Chakrabart and S.Mehrotra.Local Dimensionality Reduction: A New

Approach to Indexing High Dimensional Spaces.VLDB, pages 89--100, 2000. [6]: R.Weber, H.Schek, and S.Blott. A Quantitative Analysis and Performance

Study for Similarity Search Methods in High Dimensional Spaces. VLDB, pages 194--205, 1998.

[7]: C.Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. VLDB, 2001.

References [8]: K. V. R. Kanth, D. Agrawal, and A. K. Singh. Dimensionality reduction for

similarity searching dynamic databases. SIGMOD, 1998. [9]: Jon M. Kleinberg, Andrew Tomkins: Applications of Linear Algebra in

Information Retrieval and Hypertext Analysis. PODS 1999: 185-193 [10]: Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh

Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: 159-168

[11]: Chris H.Q. Ding. A similarity-based Probability model for latent semantic indexing. SIGIR 1999: 59-65

[12]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? VLDB 2000

high-dimensional indexing based on dimensionality reduction

Documents

data points

rdsexisting indexing

local data set

indexing rdsgdrone rds

reduction steps

data representation

dimensionality curseefficiency

dim increases