Download - When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
When Is Nearest Neighbors Indexable?
Uri Shaft (Oracle Corp.)Raghu Ramakrishnan (UW-
Madison)
Motivation -Scalability Experiments
• Dozens of papers describe experiments about index scalability with increased dimensions.– Constants are:
• Number of data points• Data and Query distribution• Index structure / search algorithm
– Variable:• Number of dimensions
– Measurement:• Performance of index.
Example From PODS 1997
Example From PODS 1997
Motivation
• In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality
• We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.
Historical Context
• Continues work done in “When Is Nearest Neighbors Meaningful?” (ICDT 1999)
• Previous work about behavior of distance distributions.
• This work about behavior of indexing structures under similar conditions.
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
Vanishing Variance
• Same definition used in ICDT 99 work (although not named in that work)
• In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small.
• We use the same result here.
Vanishing Variance• A scalability experiment contains a
series of workloads W1,W2,…,Wm,…– m is the number of dimensions– each workload W1 has n data points
and a query point (same distribution)– Distance distribution marked as Dm
• Vanishing Variance:
0)(
varlim
m
mm DE
D
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
Convex Description Index• Data points distributed to buckets (e.g.
disk pages). Access to a buckets is “all or nothing”. We allow redundancy. A bucket contains at least two data points.
• Each bucket associated with a description – a convex region containing all data points in the bucket.
• Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor.
• Cost of search is the number of data points retrieved.
Example: R-Tree• Buckets are disk pages. Under normal
construction buckets contain more than two data points each.
• Bucket descriptions are convex and contain all data points (Bounding Rectangles).
• Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).
Convex Description Indexes
• All R-Tree variants• X-Tree• M-Tree• kdb-Tree• SS-Tree and SR-Tree• Many more
Other indexes (non-CD)
• Probability structures (P-Tree, VLDB 2000)– Access based on clusters. A near
enough bucket may not be accessed
• Projection index (like VA-file)– Compression structures. – All data points accessed in pieces, not
in buckets.
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
Indexing Theorem
• If:– Scalability experiment uses a series of
workloads with Vanishing Variance– The distance metric is Euclidean– The indexing structure is Convex
Description
• Then:– The expected cost of a query converges to
the number of data points – I.e., a linear scan of the data
Sketch of Proof• Because of Vanishing Variance, the
ratio of distances between various query and data points becomes arbitrarily close to 1.
• When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:
Bucket
Q
D1 D2Y
Distances of Q, D1, D2,…, Dn are about the same.
Distance of Q to Y is much smaller
Therefore, distance of Q to data bucket is less than distance to nearest neighbor.
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
Conclusion• Dozens of papers describe
experiments about index scalability with increased dimensions.
• We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability.
• We proved that many of these experiments do not scale in dimensionality.
Conclusion
• Use this theorem to to channel indexing research into more useful and practical avenues
• Review previous results accordingly.
Future Work
• Remove restriction of at least two data points in bucket. – Easy exercise, need to take into
account the cost of traversing a hierarchical data structure.
• Investigate other Lp metrics• Investigate projection indexes
using Euclidean metric (looks like they do not scale either)
• Find scalable indexing structure for Uniform data and L metric– Hint: use compression
• Find number of data points needed for R-Tree to be practical on uniform data, L2 metric.– Approx:
Future Work
mFn 3
Questions