[ieee 2008 first international workshop on similarity search and applications (sisap) - cancun,...

Anytime K-Nearest Neighbor Search for Database Applications

Weijia Xu1, 1Texas Advanced Computing Center

University of Texas at Austin [email protected]

Daniel P Miranker2, Rui Mao2, Smriti Ramakrishnan2

2Deparmtment of Computer Science University of Texas at Austin

{ miranker, rmao, smriti}@cs.utexas.edu

Abstract

Many contemporary database applications require similarity-based retrieval of complex objects where the only usable knowledge of its domain is determined by a metric distance function. In support of these applications, we explored a search strategy for k-nearest neighbor searches with MVP-trees that greedily identifies k answers and then improves the answer set monotonically. The algorithm returns an approximate solution when terminated early, as determined by a limiting radius or an internal measure of progress. Given unbounded time the algorithm terminates with an exact solution. Approximate solutions to k-nearest neighbor search provide much needed speed improvement to hard nearest-neighbor problems. Our anytime approximate formulation is well suited for interactive search applications as well as applications where the distance function itself is an approximation. We evaluate the algorithm over a suite of workloads, including image retrieval, biological data and high-dimensional vector data. Experimental results demonstrate the practical applicability of our approach.

1. Introduction

The use of Database Management Systems (DBMSs) as general purpose analysis tools has grown tremendously in recent years. Concurrent with this explosion come the desire to explore similarity between items in a dataset. New algorithms have arisen, including join query, preference query, and top-k selection query [1-5]. The nearest neighbor query was first studied in detail as an efficient branch and bound traversal algorithm of the R-tree for geographical data [6]. As data volume and complexity increases, the efficiency of similarity-based data retrieval becomes an important issue. Early efforts to support such efficient retrieval were based on vector data [7-9]. However, many interesting data domains lack an obvious interpretation in Euclidean space and require a class of

more general algorithms. These “black-box” algorithms assume that the similarity function is a metric-distance function and nothing else is known about the data [10-13]. For this type of data, the dimension of the data can be quite large. As the dimension of the data increases [14, 15], the effectiveness of k nearest neighbor queries decreases. As a result, successful applications have been limited to highly structured data sometimes with tight query parameters [11, 16, 17]. If correctness is only ensured by exhaustive search, the curse of dimensionality implies a large search cost [15, 18].

This challenge stimulates a number of approaches to identify approximated proximity solutions [10, 17, 19-21]. Many of the previous works are discussed in vector space with synthetic vector data and extracted feature vector from image data sets. Our work is motivated by the need to fulfill practical requirements identified in a suite of biological applications. For those applications, approximated search methods are applicable while raising questions not discussed in other prior research work. More specifically, the search behaviors, e.g. the number of distance computations needed and quality of the search result, vary greatly for the individual query. It is also often difficult to evaluate the overall distance distribution for all the search space or to determine an optimal epsilon value as well as the fixed time budget.

A common approximation is to retrieve k points that are within (1+�) distance of the true kth results [19, 22, 23]. One approach is locality sensitive hashing which utilizes a hashing function that is more likely to map similar data points to the same hashing bucket [21]. The complexity of search performance is in the form of O(n1/1+ �). When the goal is to identify the exact k nearest neighbors, performance converges to a linear scan. The original algorithm of this approach is based on vector space; generating effective hashing function for data types that cannot be mapped to a coordinate system has only been achieved very recently[24].

Many existing metric space index and search algorithms can be adapted to return approximate results

First International Workshop on Similarity Search and Applications

0-7695-3101-6/08 $25.00 © 2008 IEEEDOI 10.1109/SISAP.2008.11

139

First International Workshop on Similarity Search and Applications

0-7695-3101-6/08 $25.00 © 2008 IEEEDOI 10.1109/SISAP.2008.11

139

[10, 12, 13, 18, 22, 23, 25-29]. A common strategy is to stretch the triangle inequality which effectively updates search radius to 1/(1+�) times distances of kth results with a depth first traversal algorithm[18, 22, 28]. However, for small values of �, the search process is essentially the same as an exact k nearest neighbor search. Another approach to yield approximation is terminating the search process earlier [22, 23, 25]. Zezula et al study the effect of early termination based on the representative distance distribution and distance slow down to yield with M-tree [22]. Bustos and Navarro present a probabilistic algorithm using List of Clusters and SAT [25, 27, 29]. In [25], the similarity search is limited by a fixed maximum number of distance computations, and a variety of criteria on prioritizing the search queues are discussed. Ciaccia and Patella propose a PAC-nearest neighbor query that extends the (1+�) k nearest neighbor search query with probability estimation based on distance distribution using m-tree [23]. Recently, a new index structure and probabilistic search algorithm is proposed based on sample permutations. [26].

Our approach can be deemed as an adaptation to MVP-trees of the incremental nearest neighbor search algorithm that is first applied to R-trees [30, 31]. The key algorithmic differences will be detailed in section 2. First, many of prior works on approximation methods are studied in conjunction with an index structure based on compact partitions. We are interested in studying the approximation methods using MVP-tree which is a preferred structure based on our previous results [32]. Since the hierarchy of data partitions in MVP-trees does not display the containment property of the covering radius, the best-first search heuristic also consider all pivots along the search path. Secondly, we develop and compare two different early termination conditions: limiting the search radius and limiting the number of searches of nearest neighborhoods. Yianilos first proposes radius limited nearest neighbor search. We extend his idea to return additional results found during the search process as approximated results [17]. Limiting the number of searched nearest neighborhood has the benefit that termination is not subject to internal or non-deterministic properties of search. In other words, external criteria such as a fixed time budget or the assessment of the convergence of an error term may incrementally schedule the expansion of the search. These latter properties, though not readily analyzed like other traditional approximation algorithms, are well suited for interactive searches (e.g. Internet page search) and the nearest neighbor search in course filtering processes.

Early termination is well-suited for interactive search applications. Often, a search finds k near

neighbors and places them in a putative solution set. The search continues only to confirm the putative solution set as the nearest neighbors[23]. Even though a putative solution set may be the correct solution, the time lag between finding results and confirming results can involve a large number of additional computations. In a black-box model, the data is not partitionable and an exact k-nearest neighbor search degenerates to a linear scan of the entire database [11]. Shaft and Ramakrishna formally proved a theorem that adversarial worst-cases can be constructed where a k nearest neighbor search performs as poorly as a linear scan [33].

Approximate k-NN search is also valuable when the metric itself is approximate. For search problems involving complex data such as images, or protein homology, data is often characterized with simplified heuristic function models. Using these models, a small subset of data can be quickly retrieved and piped into a subsequent algorithm for further inspection. In these cases, an early termination of the search yielding a close approximation of exact results may be sufficient. In other cases when k nearest neighbors are required, it may be sufficient to retrieve a few additional neighbors, k’ = k + c, and use other criteria to choose the k best [34, 35]. The anytime k nearest neighbor search can also be useful in the applications that require batch queries. An example is q-gram (substring of length q) based homologous sequence retrieval in bioinformatics [36, 37].

To verify the applicability and usefulness of our approach, we analyze the search performance behaviors using generated vector data together with a mechanism to terminate search early. We also demonstrate the advantage of this method with practical applications of biological sequence, spectrum data, and image data.

2. Definitions and algorithm 2.1. MVP tree constructions

The multiple-vantage point tree (MVP-tree) is a preferred data structure in the black-box model [38]. Offline, a small set of data objects is selected as vantage points. The data is partitioned into

equivalence classes per their distance to the

Figure 1 Internal node structure (top) and implemented internal node structure (bottom).

140140

pivots. The geometry of the set of partitions is recorded. Recursive application of the off-line construction results in a tree-based data structure. During retrieval the distances among the query, the pivots and the partitions is computed. Predicates comprising the triangle inequality, when satisfied, may safely eliminate partitions from further consideration, pruning branches of the MVP-tree. As a result, on-line retrieval can become effective. The MVP-tree is usually discussed within the context of range search. To assist k nearest neighbor search, we modify the internal node structure. In the original MVP-tree, the median values used to create each partition are stored in each internal node. In our implementation, we store the distance intervals between each partition and each pivot in each internal node (Figure 1). Thus, max(i, j) and min(i, j) are the maximum and minimum distances between pivot pi and the objects stored in partition j.

2.2. Proximity search strategy For a range search query q with radius r, the recursive search of each partition j can be evaluated using the following predicates:

d(pi, q) - r � r.max(i, j) (1) d(pi, q) + r � r.min(i, j) (2)

Partition j is pruned if either inequality is true for any pivot pi. The k nearest neighbor search can be supported by always first visiting the node with the lowest lower bound estimation to query object q (Definition 2). Definition 1 The lower distance bound between a given query object q and any node r in an MVP-tree, LBq,r, is the smallest possible distance from the query to any data object stored in the subtree rooted at r. Lemma 1 The lower bound between query object q and an internal node j which is the jth child of node r with v pivots [p1…pv], can be estimated as: LBq,j = max(d(pi,q)-r.max(i,j), r.min(i, j)-d(pi,q), 0) (3) i=1…v Proof: The correctness of lower bound estimation can be proved by contradiction. Assume there is an object m stored in the subtree rooted at r, and d(m, q) < LBq,r. Since d(m,q)� 0, it implies LBq,r > 0. Without loss of generality, we assume the pivot pi and partition j satisfy the maximum condition defined by equation 3. Hence,

LBq,r= max(d(pi,q)-r.max(i,j), r.min(i, j)-d(pi,q)) Based on the definition of a metric space, the following two inequalities hold true:

d(m,q) + d(m, pi) � d(pi,q) d(m,q) + d (pi, q) � d(m,pi).

By the construction of an MVP-tree: min(i,j) � d(m, pi) � max(i,j).

Hence d(m,q) � d(pi,q)- d(m, pi) � d(pi,q)-r.max(i,j)

d(m,q) � d(m,pi)-d(pi,q) � r.min(i,j)-d(pi,q) d(m,q) � max(d(pi,q)-r.max(i,j), r.min(i,j)-d(pi,q))= LBq,r, which contradicts the assumption d(m, q) < LBq,r. Therefore object m does not exist. �

Given an internal node, the lower bound estimation between the node and query depends on the selection of vantage points of that node. It is possible that a node reports smaller lower bound estimation than its parent does. Therefore, only the maximum lower bound estimation will be used: For an internal node r with parent node r.parent, the lower bound estimation is

LBq,r = max(LBq,r, LBq,r.parent ) (4) The lower bound estimation of root for any query is always zero.

Figure 2 shows pseudo code for an anytime k nearest neighbor search. During the search process, the internal node with the lowest lower bound estimation will be searched first. The algorithm is similar to the incremental nearest neighbor search algorithm [30, 31]. However, we implement a termination check before the search of each internal node (Lines 3-4) and use two priority queues. One priority queue is used to manage all internal nodes to be searched using equation (4) (Lines 13-16). And the other priority queue maintains a putative set of k answers (Lines 17-20). The k nearest results are found if the distance between the query object and the kth result in the priority queue is no bigger than the lower bound estimation of the node to be searched next. Additionally, the search can also be terminated when either termination condition c is met (Lines 3-4) or k results have been found (Line 11). Lines 6-10 output

Figure 2 Pseudo code for anytime knn search.

Anytime-k-nearest-neighbor-search (q, k, c) 1. init priority queue node_list :=root_node, result_list:=empty, result_counter:=0, target_radius:=max, kth_radius:=max; 2. while not IsEmpty(node_list) 3. if termination condition c is met 4. return result_list; 5. node n � head(node_list) 6. if LBq,n > target_radius 7. for each object p in result_list and d(p, q)� target_radius 8. output result_list.remove(p) 9. result_counter++; 10. result_list.maxsize � k-result_counter; 11. if result_counter �k stop search 12. update target_radius � LBq,n. 13 if n is an internal node with c1…ck children 14. for each child node ci 15. compute lower bound between ci and q, LBc,q 16. if LBc,q < kth_radius, queue(node_list, ci, LBc,q). 17. if n is a leaf node with r1…rm data objects 18. for each data object ri and d (ri , q) < kth_radius 19. queue (result_list, ri , d (ri , q)) 20. kth_radius � result_list.last().distance

141141

the results that can be confirmed during the search process. When the termination condition is not defined, the search proceeds to return k nearest neighbors. In the next subsection, we will detail our definition of early termination conditions. It is trivial to show the search process can always return k closest results. Additionally, the search process has following properties. Lemma 2 Using the same heuristic, no other search path is guaranteed to search fewer nodes than the search process always search the node with smallest LBq,n. first. Proof: By contradiction. Let the search path used in the k nearest algorithm be p1. Assume there is another search path p2 using the same heuristics and a node n in p1 is guaranteed to be excluded from p2. Then LBn,q must be bigger than the distance between the kth closest object and the query, dk, i.e. LBn,q > dk. Equation 4 ensures the smallest lower bound estimation monotonically increases throughout the search process starting at root. The p1 will stop at node m where LBm,q � dk. If n is in p1, then LBn,q � dk, which contradicts LBn,q > dk. Then either node n does not exist in p1 or n can’t be excluded from p2 with certainty. Hence there is no other search path that is guaranteed to expand fewer nodes. � 2.3. Early termination conditions

Early termination of the proximity search has been used to find approximate answers[25-27]. The termination conditions can be defined arbitrarily as the time elapsed, the number of computations done, or other terms that assures search quality to varying degrees. Here, we evaluate two early termination strategies. One is extended from radius-limited search [17]. This search strategy monitors the maximum distance of the neighborhood explored. The other strategy monitors the number of possible closest neighborhood explored. These two policies can be used alone or in combination. Definition 2 Radius-limited k-nearest neighbor (RKNN) search Given a set of objects D, a query q, a number k, and a distance function d(a,b), the search Q(q, k, r) returns up to k closest objects whose distance to the query object q is less or equal to r, {s| s�D; d(s, q)<r; d(s,q) �d(t,q) t�D, t�S}.

The radius limited search is first defined by Yianilos for the nearest neighbor search by introducing a limiting radius to limit the search space [17]. A radius-limited k nearest neighbor search returns k or less qualified objects. If k objects are returned, the result is identical to a k-nearest-neighbor query. If less than k objects are returned the result is identical to a range search, whose distance is equal to the limiting

radius. Hence, the search cost is always the minimum between a normal range search of radius r and a k-nearest neighbor search.

By definition 2, the search results are always the exact answer of either a range search or k nearest neighbor search. In our implementation, additional results can also be returned when the limiting radius is reached before k nearest results are found. For example, if there is only i results (i < k) within radius r to query q. Up to k-i additional results may be returned. In this case, the first i results are i closest results to the query and the rest k-i results are an approximate set for i+1 to k closest results to the query.

The second early termination strategy (Definition 3) is a more aggressive approximation search approach. During the search process, the lower bound is also the minimum distance expected between the query and the next answer. Incrementing the lower bound means the search space for the answers within the previous lower bound has been exhausted. The number of increments of the lower bound can be counted as the number of nearest neighborhoods of query being explored. For example, if some results are found between the start of the search and the first lower bound increment, then all objects that are closest (1st smallest distances) to the query have been found. Definition 3 ith smallest distance k-NN During a k nearest neighbor search, the search can be terminated early when the ith closest neighborhood of the query has been explored.

Note that the ith smallest distance does not correspond to the number of results found since it is possible that multiple objects have the same distance to the query object or no results are found between two increments of lower bound estimation. At the time of termination, there could be j results found within the ith nearest neighborhood of the query. If j=k, the exact answer for k nearest neighbor search is returned. Otherwise, an approximate answer is returned, including all neighbors within the ith smallest distance to query.

This early termination condition can also be viewed as a variation of using limiting distance. However, the limiting distance is determined by search progress instead of as a predefined parameter. As will be shown later, the ith smallest distance k-NN search demonstrates interesting search behaviors for data where the distance value is discrete.

In contrast to the common notion of � in approximate k nearest neighbor search [22], we quantitatively evaluate the quality of a putative answer set compared to an exact answer as following. Definition 4 Relative error Let T={T1,…,Tk} be an ordered set of k closest objects returned by the search of query q, and d(q,Ti)<=d(q, Ti+1); let S={S1,…Sk} be

142142

00 .0 5

0 .10 .150 .2

0 .2 50 .3

0 10 0 0 2 0 0 0 3 0 0 0

avg . numb er o f d ist ance calculat io n

rela

tive

erro

r

Figure 3 The relative error of k nearest neighbor search monotonically decreases with increasing cost.

an ordered set of k closest objects to the query among all the objects to be searched and d(q, Si)<=d(q, Si+1); then the relative error of S , E(S,T), is:

� � � ��

� ��

�

�

kii

kiii

Tqd

qTdqSd

...1

1

,

,,�

The relative error measures the overall difference of the returned kth results with the true k nearest neighbors. We believe that relative measure is more meaningful than the � measure especially when k is large. In the subsequent experiment section, the practical applications ask for a value of k in the hundreds. The large k value implies a large distance between the query and the kth neighbor. The number of data points falling within 1 to 1+ � of the distance of the kth neighbors increases greatly when the distance of the kth neighbor increases.

3. Experimental results

The synthetic data set contains randomly generated multi-dimensional vectors. The value of each dimension ranges from 0 to 20. Euclidean distance is computed as the distance between two vectors. The query set for vector data consists of 100 randomly generated vectors of the same dimension as the data set. The practical data sets are adapted from several published papers on metric distance functions.

The image dataset and the metric distance function between two image objects is a linear combination of three features of an image object described in [39]. Each image is manually labeled. The number of image of each label are: 603 of “Building”, 362 of “Bridge”(brg), 228 of “Helicopters”(air), 351 of “Automobile”(lnd), 134 of “Ship” (shp), 811 of “Bird” (brd), 1134 of “Bug” (bug), 1957 of “Mammals” (mam), 501 of “Sea creatures” (sea), 1161 of “Flower” (flw), 2133 of “Landscape” (lsc), 320 of “Cloud” (cld), 17 of “Water” (wat) and 509 of “miscellanea” (msc). The image data set consists of a total of 10,221 images. This dataset is available at [40].

The protein sequence data set contains overlapping 6-grams, or substrings of length 6, generated from an established benchmark for remote homologous protein sequence detection furnished by the National Center of Biotechnology Institute [41]. The benchmark consists of 2,892,155 q-grams from 6433 sequences and 103 query sequences with an average 174 q-grams per query sequence. For each query sequence, domain experts identified a set of true positive hits. The distance between two q-grams is measured by weighted Hamming distance [42]. Note that distances between q-grams are discrete integers.

The mass spectra data set and metric distance function were detailed in [33]. The data set contains

theoretical spectra from the Escherichia coli K12 (E. coli) genome, the human genome and a seven protein mixture from the Sashimi proteomics repository (sashimi.sourceforge.net). The ground truth set (query set) of 49 spectra is constructed by first identifying the 4000+ scans of the Sashimi seven protein mixture, using BioWorks 3.1, and choosing all +2 charged spectra that were identified. with XCorr score > 2.4.

3.1. Properties of approximate k-NN search

Figure 3 illustrates the monotonic property of the search using a synthetic data set. It plots the quality of results with respect to the cost of search, measured as the average number of distance computations. Relative error measures the degree of precision the search can achieve. When the set of objects returned by the search is different from the true k best results among all the objects, the relative error is a positive value. The relative error monotonically decreases and converges to zero when the exact answers are found (with additional computational cost). The results shown in Figure-3 are the average results of 100 randomly selected queries.

Note that the k-NN search does not stop

immediately even if all k correct results are found. The algorithm must continue checking the remaining data in the search space until it confirms that all k closest results have indeed been found. The “time lag” between finding all correct results and verifying that all correct results accounts for a large part of the search cost. Figure 4 shows the average number of distance computations over 100 queries when finding (lower line) and returning (upper line) ith results in a 120-nearest neighbor search from a data set of 100,000 5-

Figure 4 The average number of distance computations over 100 queries when finding (lower line) and returning (upper line) ith results from 100,000 data objects.

0

500

1000

1500

2000

0 25 50 75 100 125ith results

avg.

# o

f dis

tanc

e co

mpu

tato

ions

ith results returnedith results found

143143

dimensional vector data set. Two third of total search cost is to verify search results.

As Figure 5 shows, the lag between finding and returning ith results increases as the dimension of the data increases. For 10-dimensional vector data, the cost for returning results is about 10 times the cost of finding the results. For 20-dimensional data, the exact k-NN search will require a linear scan of the entire database. The results confirm empirically that the nearest neighbor becomes non-indexable as the data dimension increases. However, due to the use of an index structure, the kth result can always be found without scanning the entire database. Our results indicate the exact k nearest neighbor search spends on average more than 80% of the computations on confirming the results.

The anytime k-NN search algorithm can terminate a search at any time and return an approximate result set to optimize the trade-off between accuracy and computational cost. This trade-off is illustrated for protein q-grams (Figure 6) and image data (Figure 7). In Figure 6, the ith point corresponds to the cost and accuracy of the search to identify the top 200 results using ith smallest distance. Recall that the distance distribution among protein 6-grams is discrete. The first point shows the initial 200 results (with nearest neighbor) found at a cost of 7800 distance computations on average with an average 0.37 relative error. The second point shows the results returned by early search termination using the “2nd smallest distance rule”. The average cost increases 8 times, while the relative error is reduced to 0.08. The relative error reduces to 0 when using “5th smallest distance

rule” with the cost of over 164k distance computations. The accuracy gain diminishes as “i” gets larger.

Figure 7 shows results of similar experiments with image data. A significant drop of relative error is observed as early as the 2nd smallest distance rule. But the answer set does not converge to the optimal solution until the 74th smallest distance is confirmed. The results using vector data and spectra data, not presented here, show similar behavior. Experimental results indicate opportunities to optimize the trade-off between accuracy through early termination of the anytime k nearest neighbor search.

3.2. Early Terminated Anytime k-NN Search

The effect of approximation due to early termination of a k nearest neighbor search is studied in the context of practical applications. Although the relative error used in the previous section measures the degree of approximations between the results of early terminated and completed k nearest neighbor searches, it does not sufficiently reflect the effect of approximation for practical applications. For applications with layered processing on top of DBMSs, the goal of similarity retrieval is to select a candidate set for further processing. The exact retrieval results often contain false positive and false negative results due to the use of heuristic distance functions or relaxed search criteria. The layer processing approach provides a room for approximation results. We identified three types of applications that can make use of an approximation approach: content based image retrieval, sequence homology search and mass spectra identification. For each workload, we define an application specific accuracy measure and compare early terminated search with completed k nearest neighbor search or range search. The experimental results indicate negligible effect of approximation caused by early termination in practice.

3.2.1. Content based image retrieval. The quality of content based image retrieval is a very subjective measure and hard to define objectively. It is nearly impossible to precisely model users’ preferences and intentions computationally. We use the label associated

Figure 5 From top to bottom, average cost of returning ith result for 20-d vector data, 10-d vector data, the average cost of finding ith results with 20-d, 10-d vector data.

Figure 6 Relative error and computational costs with biological sequence data.

00.050.1

0.15

0.20.250.3

0.35

0 2000 4000 6000 8000

avg. # distance computation per query

relativ

e er

ror

Figure 7 Relative error and computational costs for image data.

0

20000

40000

60000

80000

100000

0 20 40 60 80 100ith result

avg.

# o

f dist

ance

co

mpu

tatio

n

result return 20-d vectorresult return 10-d vectorresult found 20-d vectorresult found 10-d vector

00.05

0.10.150.2

0.250.3

0.350.4

0 50000 100000 150000 200000 250000 300000avg. # of distance computat ions per query

rela

tve

erro

r

144144

with each image as a reference. Images from a set of retrieval results are considered as true positive hits if they have same label as the query image.

One hundred images are randomly selected as query set from entire data set excluding images labeled as miscellanea. To study the effect caused solely by early termination, we compared the results of early terminated k-nearest neighbor search with top-k results. Figure 8 shows averaged results for 100-NN search over 100 image queries. The relative accuracy is defined as the ratio between the number of true positive hits in the search result and the number of true positive hits in the top-k results. The relative cost is defined as the ratio between the average number of distance computations of a search and the average number of distance computations of a completed k nearest neighbor search. Points in Figure 8 correspond to various early terminated anytime k-nearest neighbor searches, e.g. the leftmost point corresponds to early termination at the 1st smallest distance. The results indicates that early termination using the 1st smallest distance rule is about two orders of magnitude faster than a completed 100-NN search with 80% of the accuracy achieved by a completed 100-NN search. If terminated after the 2nd smallest distance, the approximate search achieves 98% of the accuracy achieved by a completed 100-NN search and about two times faster than a completed 100-NN search.

3.2.2. Homologous sequence retrieval. Homologous sequence retrieval, a basic problem in biological sequence analysis, seeks to retrieve a set of sequences that are similar to the query sequence from a biological sequence database. Computational solutions often consider sequences as sets of q-grams, or fragments of length q. A q-gram based homologous sequence retrieval starts with proximity retrieval of q-grams, followed by a chaining algorithm to stitch similar q-grams together [43]. For each query sequence, similar q-grams from the database are retrieved for every q-gram in the query sequence and combined to determine a set of similar sequences. Because the final similarity is determined by combining results of multiple similar q-gram retrievals, overall performance can be

improved by slightly sacrificing accuracy of individual similar q-gram retrieval. This hypothesis is confirmed by experimental results of applying early terminated k nearest neighbor search.

The protein sequence data set contains a set of 103

queries and corresponding true positive hits [41]. On average, 174 similar q-grams retrievals are conducted for each query sequence. The homologous sequences are determined based on retrieved 6-grams using an implementation of a chaining algorithm described in [43]. For each sequence query, the accuracy is measured as the accumulated number of true positive hits found before the 50th false positive hits. Since range searches are often used for q-gram searches, we use the search accuracy and search cost of range search as baseline. The relative accuracy of a search is computed as the ratio between accuracy of the search result and the accuracy of a radius 5 range search. The relative cost of a search is computed as the ratio between average number of distance computations of that search and the average number of distance computation of radius 5 range search.

Figure 9 plots relative accuracy and relative cost of several k nearest neighbor searches. The top curve shows results of a completed k nearest neighbor search and the bottom curve shows the results of a k nearest neighbor search terminated using the 1st smallest distance rule. Points of each curve correspond to searches with different k values. From left to right, k value is set at 30, 50, 100, 150 and 200. The results show that early terminated k nearest neighbor search is about an order of magnitude faster than range search while maintaining comparable accuracy. The early

0.6

0.7

0.8

0.9

1

1.1

0 0.2 0.4 0.6 0.8relative cost

rela

tive

accu

racy

Figure 8 Relative accuracy vs. relative cost in searching for top 100 results.

Figure 10 The average accuracy measure per query sequence for various k values for mass spectra data. Solid diamond and empty square lines are for 0/1 smallest distance guarantee respectively.

0

0.2

0.4

0.6

0.7 0.8 0.9 1 1.1relative accuracy

rela

tive

cost KNN

AKNN: 1st SD

Figure 9 Relative cost and accuracy of homology search.

0

20

40

60

80

100

120

0 50 100 150 200 250 300 350number of results

accu

racy

0 smallest distance1st smallest distance

145145

050

100150200250

0 100 200 300

k

aver

age

num

ber o

f di

stan

ce c

ompu

tatio

n (in

thou

sand

)

KNNAKNN 1st SD

Figure 13 Results of search costs for different k values using sequence data.

terminated k nearest neighbor search is also two to four times faster than a completed k nearest neighbor search with slightly worse accuracy.

3.2.3. Mass spectra data. Figure 10 shows an application of approximate k-NN search for mass spectra data. In this problem, an index tree is constructed over a semi-metric distance among the set of mass spectra data to provide a coarse filtering. The accuracy is measured by the percentage of true positive hits contained within the search results. With the same limiting radius, all true positive results are found when k is larger than 50 with the 1st smallest distance rule. All results are found for k larger than 225 with the 0 smallest distance rule.

3.3. Scalability of anytime k-NN Search

Figure 11 shows the scalability of AKNN search with mass spectra data for coarse filtering. We create multiple small databases from about 650,000 theoretical fragmentation spectra. For each small database, we run a radius bounded k-NN search that can return all true positives. In contrast to range search where the cost increases in proportion to the size of the databases, the computational cost for approximate k-NN hardly increases and even decreases as the database grows larger. Similar scalability results (not presented here) are also observed for sequence data and image data.

Similar results are observed with synthetic data and are plotted in Figure 12. When the search is terminated after the 2nd smallest distance rules, the average errors range from 0.007 to 0.02. When the search is terminated after the 3rd smallest distance rules, the average errors range from 0.002 to 0.008. Figure 13 shows how search cost increases as the value of k increases using sequence data. The top curve in the figure corresponds to a completed k nearest neighbor search, and the bottom one corresponds to a k nearest neighbor search terminated after the 1st smallest distance. For both searches, the search costs, as the average number of distance computations, grows linearly as the value of k increases. However, early terminated k nearest neighbor search has a notably

smaller coefficient than the completed k nearest neighbor. Linear regression shows the k nearest neighbor search terminated after the 1st smallest distance, 75.01, is about one tenth of the coefficient for completed k nearest neighbor search, 723.34. Results of vector data, image data and mass-spectra data are similar and are not shown here.

4. Conclusions and Future Work

Dimensionality and data distribution play an important role in distance-based retrieval. The anytime k-nearest neighbor search method enables resource-bounded search to optimize the trade-off among accuracy of the results, search cost and scalability. We identify three types of database applications that can benefit from this search strategy. Experiment results show our method achieves the same accuracy at a fraction of the cost of a full range search or an exact k nearest neighbor search.

The main advantage of an anytime k nearest neighbor search is to enable an easy switch between exact and approximated search. For database applications with distance-based retrieval, the search performance varies greatly among individual queries. The anytime k nearest neighbor search allows a query engine to determine when to terminate search process. Determining the timing for early termination is central to effectiveness of approximation results and remains an open issue. The conditions detailed in this paper depend on the confirmation of search spaces explored

Figure 11 Computational costs of different search methods for the mass spectra databases of different size.

Figure 12 The average number of distance computation when the search is stopped at different conditions, all results found (solid line), 3rd smallest distance (dotted line) and 2nd smallest distance (broken line) for database size range from 200,000 to 1,800,000.

0500

10001500

2000250030003500

0 200000 400000 600000 800000database size

avg

# of

dist

ance

co

mpu

tatio

ns

range searchAKNN:1st SDAkNN:0th SD

0

500

1000

1500

2000

2500

0 500000 1000000 1500000 2000000database size

avg.

# o

f dist

ance

co

mpu

tatio

n

All results found3rd smallest distance2nd Smallest distance

146146

and work well with tested applications. Other early termination conditions can be explored to meet specific needs for future applications. Reminiscent of numerical methods, the rate of convergence of the error term is an obvious candidate. Furthermore, it is plausible that, similar to analysis methods of approximate nearest neighbor search, termination could be determined by a dynamically computed confidence level [17].

While our result confirms that the proximity search performance approaches that of a linear scan as the data dimension increases, it also presents a new aspect of usefulness of using an index structure for proximity search. Shaft and Ramakrishana proved a formal condition that the nearest neighbor search can be outperformed by linear scan[15]. Central to their proof is that the computed lower bound can always be smaller than the distance of the nearest neighbor in which case all the nodes must be examined during search. Our anytime k nearest neighbor search follows the lowest lower bound whose lower bound is zero. The results indicate that the k nearest neighbor can be, on average, found without examining the entire database due to the use of an index structure. Although we do not show any direct results, we do not expect similar behavior can be observed for linear scan where the data set appears to be random.

Our search results also illustrate a “superscalable” search behavior where the search cost decreases while the database size increases. Similar behavior is noticed in the application of parallel depth first search and other nearest neighbor search method where speed-up is observed with increasing value of k [23, 44]. It is our conjecture that the superscalable behavior we are seeing is a related phenomenon. Our applications concern finite metric-spaces. Thus, an increase in database size does not impact the definition of the space, but it does increase the density of the points in the space. Theoretically, the “superscalable” search behavior possesses immense potential to be incorporated to manage the search of large-scale data.

Our approach and other approximate nearest neighbor approaches share the goal of overcoming the “curse of dimensionality.” While our approach also generates approximated results, it differs substantially from other approximate approaches from the application aspect. Approximation approaches often concern relatively small values of k, usually 1 to 10. The distance of the kth results gets larger as k increases. For a large k value, the � value is no longer an accurate measure of the degree of overall approximation in a (1+�) k nearest neighbor search. Our search approach targets k in the range of the hundreds. The large k value is very common in scientific database applications, such as bioinformatics. In those applications, the distance function is not a

perfect model and the goal of similarity retrieval is to determine a candidate set for further processing. For this class of applications, an exact solution to the search may not actually be any better than the slightly different solutions returned by approximate search.

This research is funded in part by the National Science Foundation grant DBI-0640923. 5. References [1] S. Chaudhuri and L. Gravano, "Evaluating Top-k Selection Queries," in Proceedings of the 25th International Conference on Very Large Data Bases. Edinburgh, Scotland: Morgan Kaufmann Publishers Inc., 1999. [2] R. Agrawal and E. L. Wimmers, "A framework for expressing and combining preferences," in Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Dallas, Texas, United States: ACM Press, 2000. [3] F. Ilyas, G. Aref, and K. Elmagarmid, "Supporting top-k join queries in relational databases," The VLDB Journal, vol. 13, pp. 207-221, 2004. [4] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, "Merging the results of approximate match operations," in VLDB. Toronto, Canada: Morgan Kaufmann Publishers, 2004, pp. 636-647. [5] A. Kini, S. Shankar, J. F. Naughton, and D. J. Dewitt, "Database support for matching: limitations and opportunities," in Proceedings of the 2006 ACM SIGMOD international conference on Management of data. Chicago, IL, USA: ACM Press, 2006. [6] N. Roussopoulos, S. Kelley, and F. Vincent, "Nearest neighbor queries," in SIGMOD. San Jose, USA: ACM Press, 1995, pp. 71-79. [7] B. Jon Louis, "Multidimensional binary search trees used for associative searching," Commun. ACM, vol. 18, pp. 509-517, 1975. [8] A. Guttman, "R-trees: a dynamic index structure for spatial searching," in Proceedings of the 1984 ACM SIGMOD international conference on Management of data. Boston, Massachusetts: ACM Press, 1984. [9] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The R*-tree: an efficient and robust access method for points and rectangles," in Proceedings of the 1990 ACM SIGMOD international conference on Management of data. Atlantic City, New Jersey, United States: ACM Press, 1990. [10] H. Samet, Foundations of Multidimensional and Metric Data Structures. San Francisco, CA, USA: Morgan-Kaufmann, 2006. [11] R. Krauthgamer and J. R. Lee, "The black-box complexity of nearest-neighbor search," Theor. Comput. Sci., vol. 348, pp. 262-276, 2005. [12] P. Ciaccia, M. Patella, and P. Zezula, "M-tree: An Efficient Access Method for Similarity Search in Metric Spaces," in Proceedings of the 23rd International Conference on Very Large Data Bases. Athens, Greece: Morgan Kaufmann Publishers Inc., 1997. [13] D. Cantone, A. Ferro, A. Pulvirenti, D. R. Recupero, and D. Shasha, "Antipole Tree Indexing to Support Range

147147

Search and K-Nearest Neighbor Search in Metric Spaces," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 535-550, 2005. [14] R. Weber, H. J. Schek, and S. Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," in International Conference on Very Large Data Bases. New York City, New York, USA, 1998, pp. 194-205. [15] U. Shaft and R. Ramakrishnan, "Theory of nearest neighbors indexability," ACM Trans. Database Syst., vol. 31, pp. 814-838, 2006. [16] S. Brin, "Near Neighbor Search in Large Metric Spaces," in Proceedings of the 21th International Conference on Very Large Data Bases. Zurich, Switzerland.: Morgan Kaufmann Publishers Inc., 1995. [17] P. N. Yianilos, "Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract)," in Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms. San Francisco, California, United States: Society for Industrial and Applied Mathematics, 2000. [18] K. L. Clarkson, "Nearest neighbor queries in metric spaces," in Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. El Paso, Texas, United States: ACM Press, 1997. [19] P. Indyk and R. Motwani, "Approximate nearest neighbors: towards removing the curse of dimensionality," in Proceedings of the thirtieth annual ACM symposium on Theory of computing. Dallas, Texas, United States: ACM Press, 1998. [20] E. Kushilevitz, R. Ostrovsky, and Y. Rabani, "Efficient search for approximate nearest neighbor in high dimensional spaces," in Proceedings of the thirtieth annual ACM symposium on Theory of computing. Dallas, Texas, United States: ACM Press, 1998. [21] G. Shakhnarovich, P. Viola, and T. Darrell, "Fast Pose Estimation with Parameter-Sensitive Hashing," in Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. Nice, France: IEEE Computer Society, 2003. [22] P. Zezula, P. Savino, G. Amato, and F. Rabitti, "Approximate similarity retrieval with M-trees," The VLDB Journal, vol. 7, pp. 275-293, 1998. [23] P. Ciaccia and M. Patella, "PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces," in Proceedings of the 16th International Conference on Data Engineering. San Diego, California, USA: IEEE Computer Society, 2000. [24] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios, "Nearest Neighbor Retrieval Using Distance-Based Hashing," in IEEE International Conference on Data Engineering (ICDE), vol. to appear. Cajun, Mexico: IEEE Computer Society, 2008. [25] B. Bustos and G. Navarro, "Probabilistic proximity searching algorithms based on compact partitions," J. of Discrete Algorithms, vol. 2, pp. 115-134, 2002. [26] E. Chávez, K. Figueroa, and G. Navarro, "Proximity searching in high dimensional spaces with a proximity preserving order," in 4th Mexican International Conference on Artificial Intelligence, vol. 3789. Monterrey, Mexico: LNAI, 2005, pp. 405-414.

[27] E. Chavez and G. Navarro, "A compact space decomposition for effective metric indexing," Pattern Recogn. Lett., vol. 26, pp. 1363-1376, 2005. [28] E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin, "Searching in metric spaces," ACM Computing Surveys (CSUR), vol. 33, pp. 273-321, 2001. [29] G. Navarro, "Searching in metric spaces by spatial approximation," The VLDB Journal, vol. 11, pp. 28-46, 2002. [30] G. Hjaltason and H. Samet, "Incremental similarity search in multimedia databases," Department of Computer Science, University of Maryland, Maryland, Technical Report TR 4199, 2000. [31] G. Hjaltason and H. Samet, "Distance browsing in spatial databases," ACM Trans. Database Syst., vol. 24, pp. 265-318, 1999. [32] R. Mao, W. Xu, N. Singh, and D. P. Miranker, "An Assessment of a Metric Space Database Index to Support Sequence Homology," International Journal on Artificial Intelligence Tools (IJAIT), 2005. [33] S. R. Ramakrishnan, R. Mao, A. A. Nakorchevskiy, J. T. Prince, W. S. Willard, W. Xu, E. M. Marcotte, and D. P. Miranker, "A fast coarse filtering method for peptide identification by mass spectrometry," Bioinformatics, vol. 22, pp. 1524-31, 2006. [34] T. P. Mann and W. S. Noble, "Efficient identification of DNA hybridization partners in a sequence database," Bioinformatics, vol. 22, pp. e350-358, 2006. [35] W. Chen and K. Aberer, "Efficient Querying on Genomic Databases," in 8th Int. Work on Database and Expert System Applications. Toulouse, France: IEEE Computer Society Press, 1997. [36] W. J. Kent, "BLAT---The BLAST-Like Alignment Tool," Genome Res., vol. 12, pp. 656-664, 2002. [37] J. Buhler, "Efficient large-scale sequence comparison by locality-sensitive hashing," Bioinformatics, vol. 17, pp. 419-28, 2001. [38] T. Bozkaya and M. Ozsoyoglu, "Indexing large metric spaces for similarity search queries," ACM Trans. Database Syst., vol. 24, pp. 361-404, 1999. [39] I. Qasim and J. K. Aggarwal, "Image Retrieval Via Isotropic and Anisotropic Mappings," in Proceedings of the 1st International Workshop on Pattern Recognition in Information Systems: In conjunction with ICEIS 2001. Setúbal, Portugal: ICEIS Press, 2001. [40] "http://aug.csres.utexas.edu/mobios-workload/." [41] A. A. Schaffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul, "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucl. Acids Res., vol. 29, pp. 2994-3005, 2001. [42] W. Xu and D. P. Miranker, "A Metric Model of Amino Acid Substitution," Bioinformatics, vol. 20, pp. 1214-1221, 2004. [43] D. Gusfield, Algorithms on strings, Trees, and Sequences: computer Science and Computational biology: Cambridge University Press, 1997. [44] V. N. Rao and K. Vipin, "Superlinear Speedup in Parallel State-Space Search," in 8th Conference on Foundations of Software Technology and Theoretical Computer Science. Pune, India: Springer-Verlag, 1988, pp. 161-174.

148148

[ieee 2008 first international workshop on similarity search and applications (sisap) - cancun,...

Documents