antipole ieee paper

Upload: bhoslemanoj

Post on 08-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Antipole IEEE Paper

    1/16

    Antipole Tree Indexing to Support RangeSearch and K-Nearest Neighbor Search

    in Metric SpacesDomenico Cantone, Alfredo Ferro, Alfredo Pulvirenti, Diego Reforgiato Recupero, and Dennis Shasha

    AbstractRange and k-nearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a

    metric space M and a query object q in M, in a range searching problem the goal is to find the objects of S within some threshold

    distance to q, whereas in a k-nearest neighbor searching problem, the k elements of Sclosest to qmust be produced. These problems

    can obviously be solved with a linear number of distance calculations, by comparing the query object against every object in the

    database. However, the goal is to solve such problems much faster. We combine and extend ideas from the M-Tree, the Multivantage

    Point structure, and the FQ-Tree to create a new structure in the bisector tree class, called the Antipole Tree. Bisection is based on

    the proximity to an Antipole pair of elements generated by a suitable linear randomized tournament. The final winners a; b of such a

    tournament are far enough apart to approximate the diameter of the splitting set. If dista; b is larger than the chosen cluster diameterthreshold, then the cluster is split. The proposed data structure is an indexing scheme suitable for (exact and approximate) best match

    searching on generic metric spaces. The Antipole Tree outperforms by a factor of approximately two existing structures such as List of

    Clusters, M-Trees, and others and, in many cases, it achieves better clustering properties.

    Index TermsIndexing methods, similarity measures, information search and retrieval.

    1 INTRODUCTION

    SEARCHING is a basic problem in metricspaces.Hence, muchefforts have been spent both in clustering algorithms,which are often included in the searching process as apreliminary step (see BIRCH[53], DBSCAN[24],CLIQUE [3],BIRCH* [27], WaveClusters [46], CURE [32], and CLARANS[41]), and in the development of new indexing techniques(see, for instance, MVP-Tree [9], M-Tree [22], SLIM-Tree [48],FQ-Tree [4], List of Clusters [16], and SAT [40]; the reader isalso referred to [18] for a survey on this subject). For thespecial case of Euclidean spaces, one can see [2], [29], [8],X-Tree [7], and CHILMA [47].

    We combine andextend ideasfrom the M-Tree, MVP-Tree,andFQ-Tree structures togetherwith randomizedtechniquescoming from the approximate algorithms community [6], todesign a simple and efficient indexing scheme called AntipoleTree. This data structure is able to support range queries andk-nearest neighbor queries in generic metric spaces.

    The Antipole Tree belongs to the class of bisector trees[18], [13], [42], which are binary trees whose nodesrepresent sets of elements to be clustered. Its constructionbegins by first allocating a root r and then selecting two

    splitting points c1, c2 in the input set, which become thechildren of r. Subsequently, the points in the input set arepartitioned according to their proximity to the points c1, c2.

    Then, one recursively constructs the tree rooted in ciassociated with the partition set of the elements closer toci, for i 1; 2.

    A good choice for the pair c1; c2 of splitting pointsconsists of maximizing their distance. For this purpose,we propose a simple approximate algorithm based ontournaments of the type described in [6]. Our tournamentis played as follows: At each round, the winners of theprevious round are randomly partitioned into subsets of afixed size and their 1-medians1 are discarded. Roundsare played until one is left with less than 2 elements.The farthest pair of points in the final set is our Antipolepair of elements.

    The paper is organized as follows: In the next section, wegive the basic definitions of range search and k-nearestneighbor queries in general metric spaces and we brieflyreview relevant previous work, with special emphasis onthose structures which have been shown to be the mosteffective, such as List of Clusters [16], M-Trees [22], andMVP-Trees [9]. The Antipole Tree is described in Section 3.Techniques to compute the approximate 1-Median and the

    diameter of a subset of a generic metric space areillustrated, respectively, in Sections 3.1 and 3.2. InSection 4, we present a procedure for range searching onthe Antipole Tree. Section 5 presents an algorithm for theexact k-nearest neighbor problem. The Antipole Tree isexperimentally compared with List of Clusters, M-Tree, andMVP-Tree in Section 6. In particular, cluster diameterthreshold tuning is discussed. An approximate k-nearestneighbor algorithm is also introduced in Section 7 and acomparison with the version for approximate search of Listof Clusters [12] is given with a precision-recall analysis. InSection 8, we deal with the problem of the curse of

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005 535

    . D. Cantone, A. Ferro, A. Pulvirenti, and D.R. Reforgiato are with theDipartimento di Matematica e Informatica, Universita degli Studi diCatania, Italy, Viale Andrea Doria n. 6 95125 Cantania.E-mail: {cantone, ferro, apulvirenti, diegoref}@dmi.unict.it.

    . D. Shasha is with the Computer Science Department, New YorkUniversity, 251 Mercer Street, New York, NY 10012.E-mail: [email protected].

    Manuscript received 27 Aug. 2003; revised 9 Apr. 2004; accepted 14 Sept.2004; published online 17 Feb. 2005.For information on obtaining reprints of this article, please send e-mail to:

    [email protected], and reference IEEECS Log Number TKDE-0160-0803.

    1. We recall that the 1-median of a set of points S in a metric space is an

    element of S whose average distance from all points of S is minimal.1041-4347/05/$20.00 2005 IEEE Published by the IEEE Computer Society

    Authorized licensed use limited to: Veermata Jijabai Technological Institute. Downloaded on April 15, 2009 at 07:15 from IEEE Xplore. Restrictions apply.

  • 8/7/2019 Antipole IEEE Paper

    2/16

    dimensionality. Indeed, in high dimension, linear scan foruniform data sets may become competitive with the bestsearching algorithms. However, most of the real-world datasets are nonuniform. We successfully compare our algo-rithm with linear scan in nonuniform data sets of very high-dimensional Euclidean spaces. We draw our conclusions inSection 9. Finally, the Appendix which can be found on theComputer Society Digital Library at http://computer.org/

    tkde/archives.htm proposes an efficient approximationscheme for the diameter computation in the Euclidean case.

    2 BASIC DEFINITIONS AND RELATED WORKS

    Let M be a nonempty set of objects and let dist :M M IR be a function such that the followingproperties hold:

    1. 8 x; y 2 M distx; y ! 0 (positiveness);2. 8 x; y 2 M distx; y disty; x (symmetry);3. 8 x 2 M distx; x 0 (reflexivity) and 8x; y 2

    M x 6 y distx; y > 0 (strict positiveness);4.

    8x; y; z

    2M

    dist

    x; y

    dist

    x; z

    dist

    z; y

    (triangle inequality);then, the pair M;dist is called a metric space and dist iscalled its metric function. Well-known metric functionsinclude Manhattan distance, Euclidean distance, string editdistance, or the shortest path distance through a graph. Ourgoal is to build a low cost data structure for the range searchproblem and k-nearest neighbor searching in metric spaces.

    Definition 2.1 (Range query). Given a query object q, adatabase S, and a threshold t, the Range Search Problem is tofind all objects fo 2 Sjdisto; q tg.

    Definition 2.2 (k-Nearest Neighbor query). Given a queryobject qand an integer k > 0, the k-Nearest Neighbor Problemis to retrieve the k closest elements to q in S.

    Our basic cost measure is the number of distancecalculations, since these are often expensive in metricspaces, e.g., when computing the editing distance amongstrings.

    Three main sources of ideas have contributed to ourwork. The FQ-Tree [4], an example of a structure usingpivots (see [18] for an extended survey), organizes the itemsof a collection ranging over a metric space into the leaves of atree data structure. Viewed abstractly, FQ-Trees consist of avector of reference objects r1; ; rk and a distance vector voassociated with each object o such that voi disto; ri. Aquery object q computes a distance to each reference object,thus obtaining a vq. Object o cannot be within a threshold

    distance t from qif for any i, vqi > voi t. That is, even ifois closer to q than ri, q cannot be closer to o than t.

    We use a similar idea except that our reference objectsare the centroids of clusters.

    M-Trees [22], [20] are dynamically balanced trees. Nodesof an M-Tree store several items of the collection providedthat they are close and not too numerous. If one ofthese conditions is violated, the node is split and a suitablesubtree originating in the node is recursively constructed. Inthe M-Tree, each parent node corresponds to a cluster witha radius and every child of that node corresponds to asubcluster with a smaller radius. If a centroid x has adistance distx; q from the query object and the radius ofthe cluster is r, then the entire cluster corresponding to x

    can be discarded if distx; q > t r.

    We take the idea that a parent node corresponds to acluster and its children nodes are subclusters of that parentcluster from the M-Tree. The main differences between ouralgorithm and the M-Tree are the construction method, thatclusters in the M-Tree must have a limited number ofelements, and the search strategy as our algorithm producesa binary tree data structure.

    VP-Trees ([49], [52]) organize items coming from a metric

    space into a binary tree. The items are stored both in theleaves and in the internal nodes of the tree. The items storedin the internal nodes are the vantage points. To process aquery requires the computation of the distance between thequery point and some of the vantage points. The construc-tion of a VP-Tree partitions a data set according to thedistances that the objects have with respect to a referencepoint. The median value of these distances is used as aseparator to partition objects into two balanced subsets(those as close or closer than the median and those fartherthan the median). The same procedure can recursively beapplied to each of the two subsets.

    The Multi-Vantage-Point tree [9] is an intellectualdescendant of the vantage point tree and the GNAT [10]

    structure. The MVP-Tree appears to be superior to theprevious methods. The fundamental idea is that, given apoint p, one can partition all objects into m partitions basedon their distances from p, where the first partition consistsof those points within distance d1 from p, the secondconsists of those points whose distance is greater than d1and less than or equal to d2, etc. Given two points, pa and pb,the partitions a1; ; am based on pa and the partitionsb1; ; bm based on pb can be created. One can then intersectall possible a and b-partitions (i.e., ai intersect bj for 1 i m and 1 j m) to get m2 partitions. In an MVP-Tree, eachnode in the tree corresponds to two objects (vantage points)and m2 children, where m is a parameter of the constructionalgorithm and each child corresponds to a partition. Whensearching for objects within distance t of query point q, the

    algorithm does the following: Given a parent node havingvantage points pa and pb, if some partition Z has theproperty that for every object z2 Z, distz; pa < dz, anddistq; pa > dz t, then Z can be discarded. There are otherreasons for discarding clusters, also based on the triangleinequality. Using multiple vantage points together withprecomputed distances reduces the number of distancecomputations at query time. Like the MVP-Tree, ourstructure makes aggressive use of the triangle inequality.

    Another relevant recent work, due to Chaavez andNavarro [16], proposes a structure called List of Clusters.Such a list is constructed in the following way: Startingfrom a random point, a cluster with bounded diameter (orlimited number of objects) centered in that random point

    is constructed. Then, such a process is iterated by selectinga new point, for example, the farthest from the previousone, and constructing another cluster around it. Theprocess terminates when no more points are left. Authorsexperimentally show that their structure outperformsother existing methods when parameters are chosen in asuitable way.

    Other sources of inspiration include [11], [23], [26], [30],[45], [44], [48], [40].

    3 THE ANTIPOLE TREE

    Let (M, dist) be a finite metric space, let S be a subset ofM,and suppose that we aim to split it into the minimum

    possible number of clusters whose radii should not exceed a

    536 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

    Authorized licensed use limited to: Veermata Jijabai Technological Institute. Downloaded on April 15, 2009 at 07:15 from IEEE Xplore. Restrictions apply.

  • 8/7/2019 Antipole IEEE Paper

    3/16

    given threshold . This problem has been studied byHochbaum and Maass [35] for Euclidean spaces. Theirapproximation algorithm has been improved by Gonzalezin [31]. Similar ideas are used by Feder and Greene [25] (see[43] for an extended survey on clustering methods inEuclidean spaces).

    The Antipole clustering of bounded radius is per-formed by a recursive top-down procedure starting from

    the given finite set of points S and checking at each step if agiven splitting condition is satisfied. If this is not the case,then splitting is not performed, the given subset is a cluster,and a centroid having distance approximatively less than from every other node in the cluster is computed by theprocedure described in Section 3.1.

    Otherwise, if is satisfied then a pair of points fA; Bg ofS, called the Antipole pair, is generated by the algorithmdescribed in Section 3.2 and is used to split S into twosubsets SA and SB obtained by assigning each point p ofStothe subset containing the endpoint closest to p of theAntipole fA; Bg. The splitting condition states thatdistA; B is greater than the cluster diameter thresholdcorrected by the error coming from the Euclidean case

    analysis described in the Appendix which can be found onthe Computer Society Digital Library at http://computer.org/tkde/archives.htm. Indeed, the diameter threshold isbased on a statistical analysis of the pairwise distances ofthe input set (see Section 6.2) which can be used to evaluatethe intrinsic dimension [18] of the metric space. The treeobtained by the above procedure is called an Antipole Tree.All nodes are annotated with the Antipole endpoints andthe corresponding cluster radius; each leaf contains also the1-median of the corresponding final cluster. Its implemen-tation is described in Section 3.3.

    3.1 1-Median

    In this section, we review a randomized algorithm for the

    approximate 1-median selection [14], an important sub-routine in our Antipole Tree construction. It is based on atournament played among the elements of the input set S.At each round, the elements which passed the precedingturn are randomly partitioned into subsets, say X1; . . . ; Xk.Then, each subset Xi is locally processed by a procedurewhich computes its exact 1-median xi. The elementsx1; . . . ; xk move to the next round. The tournamentterminates when we are left with a single element x, thefinal winner. The winner approximates the exact 1-medianin S. Fig. 1 contains the pseudocode of this algorithm. Thelocal optimization procedure 1-MEDIAN X returns theexact 1-median in X. A running time analysis (see [14] fordetails) shows that the above procedure takes time t

    2n

    on in the worst-case.

    3.2 The Diameter (Antipole) Computation

    Let M; d be a metric space with distance function dist :M M7IR and let Sbe a finite subset ofM. The diametercomputation problem or furthest pair problem istofindthepairof

    points A; Bin S such that distA; B ! distx; y; 8x; y 2 S.As observed in [36], we can construct a metric space

    where all distances among objects are set to 1 except for one

    (randomly chosen) which is set to 2. In this case, anyalgorithm that tries to give an approximation factor greaterthan 1=2 must examine all pairs, so a randomized algorithmwill not necessarily find that pair.

    Nevertheless, we expect a good outcome in nearly allcases. Here, we introduce a randomized algorithm inspiredby the one proposed for the 1-median computation [14] andreviewed in the preceding section. In this case, each subsetXi is locally processed by a procedure LOCAL WINNERwhich computes its exact 1-median xi and then returns theset Xi, obtained by removing the element xi from Xi. Theelements in X1 [ X2 . . . [ Xk are used in the subsequentstep. The tournament terminates when we are left with a

    single set, X, from which we extract the final winners A; B,as the furthest points in X. The pair A; B is called theAntipole pair and their distance represents the approximatediameter of the set S.

    The pseudocode of the Antipole algorithm

    APPROX ANTIPOLE;

    similar to that of the 1-Median algorithm, is given in Fig. 1.A faster (but less accurate) variant of

    APPROX ANTIPOLE

    can be used. Such variant, called

    FAST APPROX ANTIPOLE;

    consists of taking Xi as the farthest pair of Xi. Itspseudocode can therefore be obtained simply by replacingin APPROX ANTIPOLE each call to LOCAL WINNER bya call to FIND ANTIPOLE. In the next section, we willprove that both variants have a linear running time in then um be r o f e le me nt s. W e w il l a ls o s ho w t ha tFAST APPROX ANTIPOLE is also linear in the tourna-ment size , whereas APPROX ANTIPOLE is quadraticwith respect to .

    For tournaments of size 3, both variants plainly coincide.Thus, since in the rest of the paper only tournaments ofsize 3 will be considered, by referring to the faster variant

    we will not loose any accuracy.

    CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 537

    Fig. 1. The 1-Median algorithm.

    Authorized licensed use limited to: Veermata Jijabai Technological Institute. Downloaded on April 15, 2009 at 07:15 from IEEE Xplore. Restrictions apply.

  • 8/7/2019 Antipole IEEE Paper

    4/16

    3.2.1 Running Time Analysis of Antipole Computation

    Two fundamental parameters present in the algorithmreported in Fig. 2 (also reported in Fig. 1), namely, thesplitting factor (also referred to as the tournament size) andthe parameter threshold, need to be tuned.

    The splitting factor is used to set the size of each subsetXprocessed by procedure LOCAL WINNER, with the only

    exception of one subset for each round of the tournament(whose size is at most 2 1), and the argument of the lastcall to FIND ANTIPOLE (whose size is at most equal tothreshold). It is clear that the larger values of are, thebetter the output quality is and the higher the computa-tional costs are. In many cases, a satisfying output qualitycan be obtained even with small values for .

    A good trade off between output quality and computa-

    tionalcostisobtainedbychoosingasvalueforoneunitmore

    than the dimension that characterizes the investigated metric

    space [18]. This suggestion lies on intuitive grounds devel-

    opedin the caseof a Euclidean metric space IRm and is largely

    confirmedby the experiments reportedin [14].The parameter

    thresholdcontrols the termination of the tournament. Again,larger values for threshold ensure better output quality,

    though at increasing cost. Observe that the value 2 1 forthreshold forces the property that the last set of elements,

    where the final winner is selected, must contain at least

    elements, provided that jSj ! . Moreover,in order to ensurea linear computational complexity of the algorithm, the

    threshold value need to be OffiffiffiffiffiffijSjp . Consequently, a good

    choice is threshold min 2 1;ffiffiffiffiffiffijSjpn o.

    The algorithm APPROX ANTIPOLE given in Fig. 2 ischaracterized by its simplicity and, hence, it is expected tobe very efficient from the computational point of view, at

    least in the case in which the parameters and thresholdaretaken small enough. In fact, we will show below that ouralgorithm has a worst-case complexity of 1

    2n on in

    the input size n, provided that threshold is o ffiffiffinp .P l ai n ly , t h e c o mp l ex i t y o f t h e a l go r it h m

    APPROX ANTIPOLE is dominated by the number ofdistances computed by it within calls to procedureLOCAL WINNER. We shall estimate below such a number.

    Let Wn ; ; # be the number of calls to procedureLOCAL WINNER made w it hin t he while- loops b yAPPROX ANTIPOLE, with an input of size n and usingpar ame t e r s ! 3 a n d t h re s h ol d # ! 1. P l ai n ly ,Wn ; ; # Wn;; 1, for any # ! 1, thus it will sufficeto find an upper bound for Wn;; 1. For notationalconvenience, let us put W1n Wn;; 1, where has

    been fixed. It can easily be seen that W1n satisfies thefollowing recurrence relation:

    W1n 0 if 0 n 2;1 if 3 n < 2;bn

    c W1 1 bnc

    if n ! 2:

    8