filtering search a new approach to query-answering

8/3/2019 Filtering Search a New Approach to Query-Answering

1/22

SIAM J. COMPUT.Vol. 15, No. 3, August 1986

(C ) 1986 Society for Industrial and Applied Mathematics006

FILTERING SEARCH: A NEW APPROACH TO QUERY-ANSWERING*BERNARD CHAZELLE"

Abstract. We introduce a new technique for solving problems of the following form: preprocess a setof objects so that those satisfying a given property with respect to a query object can be listed very effectively.Well-known problems that fall into this category include range search, point enclosure, intersection, andnear-neighbor problems. The approach which we take is very general and rests on a new concept calledfiltering search. We show on a number of examples how it can be used to improve the complexity of knownalgorithms and simplify their implementations as well. In particular, filtering search allows us to improveon the worst-case complexity of the best algorithms known so far for solving the problems mentioned above.

Key words, computational geometry, database, data structures, filtering search, retrieval problemsAMS (MOS) subject classifications. CR categories 5.25, 3.74, 5.39

1. Introduction. A considerable amount of attention has been recently devoted tothe problem of preprocessing a set S of objects so that all the elements of S that satisfya given property with respect to a query object can be listed effectively. In such aproblem (traditionally known as a retrieval problem), we assume that queries are tobe made in a repetitive fashion, so preprocessing is likely to be a worthwhile investment.Well-known retrieval problems include range search, point enclosure, intersection, andnear.neighbor problems. In this paper we introduce a new approach for solving retrievalproblems, which we call filtering search. This new technique is based on the observationthat the complexity of the search and the report parts of the algorithms should bemade dependent upon each other--a feature absent from most of the algorithmsavailable in the literature. We show that the notion of filtering search is versatile enoughto provide significant improvements to a wide range of problems with seeminglyunrelated solutions. More specifically, we present improved algorithms for the followingretrieval problems:

1. Interval Overlap: Given a set S of n intervals and a query interval q, reportthe intervals of S that intersect q [8], [17], [30], [31].

2. Segment Intersection: Given a set S of n segments in the plane and a querysegment q, report the segments of S that intersect q [19], [34].

3. Point Enclosure: Given a set S of n d-ranges and a query point q in R d, reportthe d-ranges of S that contain q (a d-range is the Cartesian product of d intervals)[31], [33].

4. Orthogonal Range Search: Given a set S of n points in R a and a query d-rangeq, report the points of S that lie within q [1], [2], [3], [5], [7], [9], [21], [22], [23],[26], [29], [31], [35], [36].

5. k-Nearest-Neighbors" Given a set S of n points in the Euclidean plane E 2 anda query pair (q, k), with q E 2,/c =< n, report the/ points of S closest to q 11 ], [20],[25].

6. Circular Range Search: Given a set $ of n points in the Euclidean plane anda query disk q, report the points of S that lie within q [4], [10], [11], [12], [16], [37].

* Received by the editors September 15, 1983, and in final revised form April 20 , 1985. This researchwas supported in part by the National Science Foundation under grants MCS 83-03925, and the Office ofNaval Research and the Defense Advanced Research Projects Agency under contract N00014-83-K-0146and ARPA order no. 4786.f Department of Computer Science, Brown University, Providence, Rhode Island 02912.

703


2/22

704 BERNARD CHAZELLEFor all these problems we are able to reduce the worst-case space x time complexity

of the best algorithms currently known. A summary of our main results appears in atable at the end of this paper, in the form of complexity pairs (storage, query time).

2. Filtering search. Before introducing the basic idea underlying the concept offiltering search, let us say a few words on the applicability of the approach. In orderto make filtering search feasible, it is crucial that the problems specifically require theexhaustive enumeration of the objects satisfying the query. More formally put, theclass of problems amenable to filtering search treatment involves a finite set of objects$, a (finite or infinite) query domain Q, and a predicate P defined for each pair in$ x Q. The question is then to preprocess S so that the function g defined as followscan be computed efficiently:

g: Q- 2 s ;[g(q) { v SIP(v, q) is true}].By computing g, we mean reporting each object in the set g(q) exactly once. Wecall algorithms for solving such problems reporting algorithms. Database management

and computational geometry are two areas where retrieval problems frequently arise.Numerous applications can also be found in graphics, circuit design, or statistics, justto pick a few items from the abundant literature on the subject. Note that a related,yet for our purposes here, fundamentally different class of problems, calls for computinga single-valued function of the objects that satisfy the predicate. Counting instead ofreporting the objects is a typical example of this other class.On a theoretical level it is interesting to note the dual aspect that characterizesthe problems under consideration: preprocessing resources vs . query resources. Althoughthe term "resources" encompasses both space and time, there are many good reasonsin practice to concentrate mostly on space and query time. One reason to worry moreabout storage than preprocessing time is that the former cost is permanent whereasthe latter is temporary. Another is to see the preprocessing time as being amortizedover the queries to be made later on. Note that in a parallel environment, processorsmay also be accounted as resources, but for the purpose of this paper, we will assumethe standard sequential RAM with infinite arithmetic as our only model of computation.

The worst-case query time of a reporting algorithm is a function of n, the inputsize, and k, the number of objects to be reported. For all the problems discussed inthis paper, this function will be of the form O(k+f(n)), where f is a slow-growingfunction of n. For notational convenience, we will say that a retrieval problem admitsof an (s(n), f(n))-algorithm if there exists an O(s(n)) space data structure that canbe used to answer any query in time O(k +f(n)). The hybrid form of the query timeexpression distinguishes the search component of the algorithm (i.e. f(n)) from thereport part (i.e. k). In general it is rarely the case that, chronologically, the first parttotally precedes the latter. Rather, the two parts are most often intermixed. Informallyspeaking, the computation usually resembles a partial traversal of a graph whichcontains the objects of S in separate nodes. The computation appears as a sequenceof two-fold steps: (search; report).

The purpose of this work is to exploit the hybrid nature of this type of computationand introduce an alternative scheme for designing reporting algorithms. We will showon a number of examples how this new approach can be used to improve the complexityof known algorithms and simplify their implementations. We wish to emphasize thefact that this new approach is absolutely general and is not a priori restricted to anyparticular class of retrieval problems. The traditional approach taken in designing


3/22

FILTERING SEARCH 705reporting algorithms has two basic shortcomings:

1. The search technique is independent of the number of objects to be reported.In particular, there is no effort to balance k and f(n).

2. The search attempts to locate the objects to be reported and only those.The first shortcoming is perhaps most apparent when the entire set S is to bereported, in which case no search at all should be required. In general, it is clear that

one could take great advantage of an oracle indicating a lower bound on how manyobjects were to be reported. Indeed, this would allow us to restrict the use of asophisticated structure only for small values of k. Note in particular that if the oracleindicates that n/k-O(1) we may simply check all the objects naively against thequery, which takes O(n)-O(k) time, and is asymptotically optimal. We can nowintroduce the notion of filtering search.

A reporting algorithm is said to use filtering search if it attempts to match searchingand reporting times, i.e. f(n) and k. This feature has the effect of making the searchprocedure adaptively efficient. With a search increasingly slow for large values of k,it will be often possible to reduce the size of the data structure substantially, followingthe general precept: the larger the output, the more naive the search.

Trying to achieve this goal points to one of the most effective lines of attack forfiltering search, also thereby justifying its name: this is the prescription to report moreobjects than necessary, the excess number being at most proportional to the actualnumber of objects to be reported. This suggests including a postprocessing phase inorder to filter out the extraneous objects. Often, of course, no "clean-up" will beapparent.

What we are saying is that O(k+f(n)) extra reports are permitted (as long, ofcourse, as any object can be determined to be good or bad in constant time). Oftin,when k is much smaller than f(n), it will be very handy to remember that O (f(n))extra reports are allowed. This simple-minded observation, which is a particular caseof what we just said, will often lead to striking simplifications in the design of reportingalgorithms. The main difficulty in trying to implement filtering search is simulating theoracle which indicates a lower bound on k beforehand. Typically, this will be achievedin successive steps, by guessing higher and higher lower bounds, and checking theirvalidity as we go along. The time allowed for checking will of course increase as thelower bounds grow larger.

The idea of filtering search is not limited to algorithms which admit of O(k +f(n))query times. It is general and applies to any retrieval problem with nonconstant outputsize. Put in a different way, the underlying idea is the following: let h(q) be the time"devoted" to searching for the answer g(q) and let k(q) be the output size; the reportingalgorithm should attempt to match the functions h and k as closely as possible. Notethat this induces a highly nonuniform cost function over the query domain.To summarize our discussion so far, we have sketched a data structuring tool,filtering search; we have stated its underlying philosophy, mentioned one means torealize it, and indicated one useful implementation technique.

1. The philosophy is to make the data structure cost-adaptive, i.e. slow andeconomical whenever it can afford to be so .

2. The means is to build an oracle to guess tighter and tighter lower bounds onthe size of the output (as we will see, this step can sometimes be avoided).

3. The implementation technique is to compute a coarse superset of the outputand filter out the undesirable elements. In doing so , one must make sure that the setcomputed is at most proportional in size to the actual output.


4/22


5/22

FILTERING SEARCH 707W(S) induce a partition of [am, Ct2.] into contiguous intervals Ira," ", I, and theirendpoints.

Let us assume for the time being that it is possible to define a window-list whichsatisfies the following density-condition: (8 is a constant> 1)

and

28

tx I, S(x) _ W and 0


6/22

708 BERNARD CHAZELLEfound so far in Wj . The variable "cur" is only needed to update "low". It denotes thenumber of intervals overlapping the current position. Whenever the condition is nolonger satisfied, the algorithm initializes a new window (Clear) with the intervalsalready overlapping in it .

A few words must be said about the treatment of endpoints with equal value.Since ties are broken arbitrarily, the windows of null aperture which will result fromshared endpoints may carry only partial information. The easiest way to handle thisproblem is to make sure that the pointer from any shared endpoint (except for a2,)leads to the first window to the right with nonnull aperture. If the query falls right onthe shared endpoint, both its left and right windows with nonnull aperture are to beexamined; the windows in between are ignored, and for this reason, need not be stored.By construction, the last two parts of the density-condition, Vx Ij, S(x) Wj and0


7/22

FILTERING SEARCH 709THEOREM 1. There exist an n, log n )-algorithm, based on window-lists, for solving

the interval overlap problem, and an n, 1)-algorithm for solving the discrete version ofthe problem. Both algorithms are optimal.

4. Segment intersection problems. Given a set S of n segments in the Euclideanplane and a query segment q in arbitrary position, report all the segments in S thatintersect q.

For simplicity, we will assume (in 4.1, 4.2, 4.3) that the n segments ma yintersect only at their endpoints. This is sometimes directly satisfied (think of the edgesof a planar subdivision), but at any rate we can always ensure this condition by breakingup intersecting segments into their separate parts. Previous work on this problemincludes an (n log n, log n)-algorithm for solving the orthogonal version of the problemin two dimensions, i.e. when query and data set segments are mutually orthogonal[34], and an (n 3, log n)-algorithm for the general problem [19].We show here ho w to improve both of these results, using filtering search. Wefirst give improved solutions for the orthogonal problem ( 4.1) and generalize ourresults to a wider class of intersection problems ( 4.2). Finally we turn our attentionto the general problem ( 4.3) and also look at the special case where the query is aninfinite line ( 4.4).

4.1. The hive-graph. We assume here that the set S consists of horizontal segmentsand the query segment is vertical. We present an optimal (n, log n)-algorithm for thisproblem, which represents an improvement of a factor of log n in the storage require-ment of the best method previously known [34]. The algorithm relies on a ne w datastructure, which we call a hive-graph.To build our underlying data structure, we begin by constructing a planar graphG, called the vertical adjacency map. This graph, first introduced by Lipski and Preparata[27], is a natural extension of the set S. It is obtained by adding the following linesto the original set of segments: for each endpoint M, draw the longest vertical linepassing through M that does not intersect any other segment, except possibly at anendpoint. Note that this "line" is always a segment, a half-line, or an infinite line. Itis easy to construct an adjacency-list representation of G in O(n log n) time by applyinga standard sweep-line algorithm along the x-axis; we omit the details. Note that Gtrivially requires O(n) storage.

Next, we suggest a tentative algorithm for solving our intersection problem:preprocess G so that any point can be efficiently located in its containing region. Thisplanar point location problem can be solved for general planar subdivisions in O(log n)query time, using O(n) space [15], [18], [24], [28]. We can now locate, say, the lowestendpoint, (x, Yl), of the query segment q, and proceed to "walk" along the edges ofG, following the direction given by q. Without specifying the details, it is easy to seethat this method will indeed take us from one endpoint to the other, while passingthrough all the segments of G to be reported. One fatal drawback is that many edgestraversed ma y not contribute any item to the output. Actually, the query time may belinear in n, even if few intersections are to be reported (Fig. 2).To remedy this shortcoming we introduce a ne w structure, the hive-graph of G,denoted H(G). H(G) is a refinement of G supplied with a crucial neighboring property.Like G, H(G) is a planar subdivision with O(n) vertices, whose bounded faces arerectangles parallel to the axes. It is a "supergraph" of G, in the sense that it can beconstructed by adding only vertical edges to G. Furthermore, H(G) has the importantproperty that each of its faces is a rectangle (possibly unbounded) with at most twoextra vertices, one on each horizontal edge, in addition to its 4 (o r fewer) corners. It


8/22

710 BERNARD CHAZELLE

xy

FIG. 2

is easy to see that the existence of H(G) brings about an easy fix to our earlier difficulty.Since each face traversed while following the querys direction contains an edgesupported by a segment of S to be reported, the presence of at most 6 edges per faceensures an O(k) traversal time, hence an O(k+log n) query time.

A nice feature of this approach is that it reports the intersections in sorted order.Its "on-line" nature allows questions of the sort: report the first k intersections witha query half-line. We now come back to our earlier claim.LEMMA 2. There exists a hive-graph of G; it requires O(n) storage, and can becomputed in O( n log n) time.

Proof. We say that a rectangle in G has an upper (resp. lower) anomaly if itsupper (resp. lower) side consists of at least three edges. In Fig. 3, for example, theupper anomalies are on s3, s7, and the lower anomalies on sl, s2, s4. In order to produceH(G), we augment the graph G in two passes, one pass to remove each type ofanomaly. If G is given a standard adjacency-list representation and the segments ofS are available in decreasing y-order, sl,..., s,, the graph H(G) can be computedin O(n) time. Note that ensuring these conditions can be done in O(n log n) steps.

The two passes are very similar, so we may restrict our investigation to the firstone. W log, we assume that all y-coordinates in S are distinct. The idea is to sweep

$1

FIG. 3

$7


9/22

FILTERING SEARCH 711an infinite horizontal line L from top to bottom, stopping at each segment in order toremove the upper anomalies. Associated with L we keep a data structure X(L), whichexactly reflects a cross-section of G as it will appear after the first pass.

X(L) can be most simply (but not economically) implemented as a doubly linkedlist. Each entry is a pointer to an old vertical edge of G or to a new vertical edge. Wecan assume the existence of flags to distinguish between these two types. Note thedistinction we make between edges and segments. The notion of edges is related toG, so for example, each segment si is in general made of several edges. Similarly,vertical segments adjacent to endpoints of segments of S are made of two edges.Informally, L performs a "combing" operation: it scans across G from top to bottom,dragging along vertical edges to be added into G. The addition of these new segmentsresults in H(G).

Initially L lies totally above G and each entry in X(L) is old (in Fig. 3, X(L)starts out with three entries). In the following, we will say that a vertical edge is above(resp. below) si if it is adjacent to it and lies completely above (resp. below) si . Fori= 1, . ., n, perform the following steps.

1. Identify relevant entries. Let at be the leftmost vertical edge above s. From ascan along X(L) to retrieve, in order from left to right, all the edges A {al, a2," am}above s.2. Update adjacency lists. Insert into G the vertices formed by the intersectionsof s and the new edges of A. Note that if i- 1, there are no such edges.3. Update X(L). Let B {bl,.. ", bp} be the edges below s, in order from left toright (note that a and bl are collinear). Let F be a set of edges, initially empty. Foreach j 1, , p- 1, retrieve the edges of A that lie strictly between bj and bj+. Bythis, we mean the set {au, au+l, , av} such that the x-coordinate of each Clk(U k .


10/22

712 BERNARD CHAZELLEIt is not difficult to see that G, in its final state, is free of upper anomalies. For

every si, in turn, the algorithm considers the upper sides of each rectangle attached tosi below, and subdivides these rectangles (if necessary) to ensure the presence of atmost one extra vertex per upper side. Note that the creation of new edges may resultin new anomalies at lower levels. This propagation of anomalies does not make itobvious that the number of edges in G should remain linear in n. This will be shownin the next paragraph.

The second pass consists of pulling L back up, applying the same algorithm withrespect to lower sides. It is clear that the new graph produced, H(G), has no anomaliessince the second pass cannot add new upper anomalies. To see that the transformationdoes not add too many edges, we can imagine that we start with O(n) objects whichhave the power to "regenerate" themselves (which is different from "duplicate", i.e.the total number of objects alive at any given time is O(n) by definition). Since weextend only every other anomaly, each regeneration implies the "freezing" of at leastanother object. This limits the maximum number of regenerations to O(n), and eachpass can at most double the number of vertical edges. This proves that IH(G)[- O(n),which completes the proof. 0THEOREM 2. It is possible to preprocess n horizontal segments in O( n log n) timeand O(n) space, so that computing their intersections with an arbitrary vertical querysegment can be done in O(k / log n) time, where k is the number of intersections to bereported. The algorithm is optimal.Note that if the query segment intersects the x-axis we can start the traversal fromthe intersection point, which will save u s the complication of performing planar pointlocation. Indeed, it will suffice to store the rectangles intersecting the x-axis in sortedorder to be able to perform the location with a simple binary search. This gives us thefollowing result, which will be instrumental in 5.COROLLARY 1. Given a set S of n horizontal segments in the plane and a verticalquery segment q intersecting a fixed horizontal line, it is possible to report all k segmentsof S that intersect q in O(k + log n) time, using O(n) space. The algorithm involves abinary search in a sorted list (O(log n) time), followed by a traversal in a graph (O(k)time).We wish to mention another application of hive-graphs, the iterative search of adatabase consisting of a collection of sorted lists S1, , S,,. The goal is to preprocessthe database so that for any triplet (q, i, j), the test value q can be efficiently lookedup in each of the lists S, S+I, , S (assuming (j). If n is the size of the database,this can be done in O((j +/)log n) time by performing a binary search in each of therelevant lists. We can propose a more efficient method, however, by viewing each listS as a chain of horizontal segments, with y-coordinates equal to i. In this way, thecomputation can be identified with the computation of all intersections between thesegment [(q, i) , (q,j)] and the set of segments. We can apply the hive-graph techniqueto this problem.COROLLARY 2. Let C be a collection of sorted lists S1, ", Sin, of total size n, withelements chosen from a totally ordered domain U. There exists an O( n size data structureso that for any q U and i,j(1


11/22

FILTERING SEARCH 713situation, as we proceed to show. Let J be the planar subdivision defined as follows:for each endpoint p in S, draw the longest line collinear with the origin O, passingthrough p, that does not intersect any other segment except possibly at an endpoint(Fig. 5). We ensure that this "line" actually stops at point O if passing through it, sothat it is always a segment or a half-line.

FIG. 5

It is easy to construct the adjacency-list of J in O(n log n) time, by using a standardsweep-line technique. The sweep-line is a "radar-beam" centered at O. At any instant,the segments intersecting the beam are kept in sorted order in a dynamic balancedsearch tree, and segments are either inserted or deleted depending on the status (firstendpoint or last endpoint) of the vertex currently scanned. We omit the details. In thefollowing, we will say that a segment is centered if its supporting line passes throughO. As before, the hive-graph H is a planar subdivision with O(n) vertices built on topof J. It has the property that each of its faces is a quadrilateral or a triangle (possiblyunbounded) with two centered edges and at most two extra vertices on the noncenterededges (Fig. 6). As before, H can be easily used to solve the intersection problem athand; we omit the details.We can construct H efficiently, proceeding as in the last section. All we need isa ne w partial order among the segments of S. We say that si


12/22

714 BERNARD CHAZELLE

/\ iii III

FIG. 6Lemma 3 allows us to embed the relation -


13/22

FILTERING SEARCH 715

4.3. The algorithm for the general case. Consider the set of all lines in the Euclideanplane. It is easy to see that the n segments of S induce a partition of this set intoconnected regions. A mechanical analogy will help to understand this notion. Assumethat a line L, placed in arbitrary position, is free to move continuously anywhere solong as it does not cross any endpoint of a segment in S. The range of motion can beseen as a region in a "space of lines", i.e. a dual space. To make this notion morefdrmal, we introduce a well-known geometric transform, T, defined as follows: a pointp: (a, b) is mapped to the line Tp: y ax + b in the dual space, and a line L: y kz + dis mapped to the point TL: (-k, d). It is easy to see that a point p lies above (resp.on) a line L if and only if the point TL lies below (resp. on) the line Tp . Note that themapping T excludes vertical lines. Redefining a similar mapping in the projectiveplane allows us to get around this discrepancy. Conceptually simpler, but less elegant,is the solution of deciding on a proper choice of axes so that no segment in S isvertical. This can be easily done in O(n log n) time.

The transformation T can be applied to segments as well, and we can easily seethat a segment s is mapped into a double wedge, as illustrated in Fig. 8. Observe thatsince a double wedge cannot contain vertical rays, there is no ambiguity in the definitionof the double wedge once the two lines are known. We can now express the intersectionof segments by means of a dual property. Let W(s) and C(s) denote, respectively,the double wedge of segment s and the intersection point thereof (Fig. 8).

FIG. 8

LEMMA 5. A segment s intersects a segment if and only if C(s) lies in W( t) andC (t) lies in W(s).Proof. Omitted. [3This result allows us to formalize our earlier observation. Let G be the subdivision

of the dual plane created by the n double wedges W(si), with S {Sl, ", sn}. Thefaces of G are precisely the dual versions of the regions of lines mentioned above.More formally, each region r in G is a convex (possibly unbounded) polygon, whosepoints are images of lines that all cross the same subset of segments, denoted S(r).Since the segments of S are nonintersecting, except possibly at their endpoints, theset S(r) can be totally ordered, and this order is independent of the particular pointin r.

Our solution for saving storage over the (n 3, log n)-algorithm given in 19 ] relieson the following idea. Recall that q is the query segment. Let p(q) denote the verticalprojection of C(q) on the first line encountered below in G. If there is no such line,no segment of S can intersect q (or even the line supporting q) since double wedgescannot totally contain any vertical ray. Since the algorithm is intimately based on the


14/22

716 BERNARD CHAZELLEfast computation of p(q), we will devote the next few lines to it before proceedingwith the main part of the algorithm.

From [14], [20], we know how to compute the planar subdivision formed by nlines in O(n 2) time. This clearly allows us to obtain G in O(n 2) time and space. Thenext step is to organize G into an optimal planar point location structure [15], [18],[24], [28]. With this preprocessing in hand, we are able to locate which region containsa query point in O(log n) steps. In order to compute p(q) efficiently, we notice thatsince each face of G is a convex polygon, we can represent it by storing its two chainsfrom the leftmost vertex to the rightmost one. In this way, it is possible to determinep(q) in O(log n) time by binary search on the x-coordinate of C(q).We are now ready to describe the second phase of the algorithm. Let H designatethe line T-q), and let K denote the line supporting the query segment q (note that

-1K Tc(q) ). The lines H and K are parallel and intersect exactly the same segmentsof S, say, si,,"sik, and in the same order. Let sio, so/,,..., s_, s be the list ofsegments intersecting q- aft, with a b (Fig. 9). Let p be the point whose transformTp is the line of G passing through p(q) (note that p is an endpoint of some segmentin S). It is easy to see that 1) if p lies between so and si, the segments to be reportedare exactly the segments of S that intersect either pa or pf l (Fig. 9-A). Otherwise, 2)so,..., s; are the first segments of S intersected by pa or pf l starting at a or /3,respectively (Fig. 9-B). Assume that we have an efficient method for enumerating thesegments in S intersected by pa (resp. pfl), in the order they appear from a (resp./3).We can use it to report the intersections of S with pa and pf l in case 1), as well asthe relevant subsequence of intersections of S with pa or pf l in case 2). Note that inthe latter case it takes constant time to decide which of pa or pf l should be considered.Our problem is now essentially solved since we can precompute a generalized hive-graph centered at each endpoint in S, as described in 4.2, and apply it to report theintersections of S with pa and pfl.

THEOREM 4. It is possible to preprocess n segments in O( n log n) time and O( n 2)space, so that computing their intersections with an arbitrary query segment can be donein O( k 4-log n) time, where k is the number of intersections to be reported. It is assumedthat the interior of the segments are pairwise disjoint.

A)FIG. 9


15/22

FILTERING SEARCH 7174.4. The special case of a query line. We will show that if the query q is an infinite

line, we can reduce the preprocessing time to O(n 2) and greatly simplify the algorithmas well. At the same time, we can relax the assumption that the segments of S shouldbe disjoint. In other words, the segments in S may now intersect. All we have to dois compute the set R of double wedges that contain the face F where C(q)(= Tq) lies.To do so , we reduce the problem to one-dimensional point enclosure. We need twodata structures for each line L in G. Let us orient each of the = 1). Note that if d 1 we have a special case of the interval overlapproblem treated in 3.We look at the case d 2, first. S consists of n rectangles whose edges are parallelto the axes. Let L be an infinite vertical line with IS vertical rectangle edges on eachside. The line L partitions S into three subsets" St (resp. S) contains the rectanglescompletely to the left (resp. right) of L, and Sm contains the rectangles intersecting L.We construct a binary tree T of height O(log n) by associating with the root the setR(root) S,, and defining the left (resp. right) subtree recursively with respect to St(resp. Sr).

Let v be an arbitrary node of T. Solving the point enclosure problem with respectto the set R(v) can be reduced to a generalized version of a one-dimensional pointenclosure problem. W log, assume that the query q (x, y) is on the left-hand side ofthe partitioning line L(v) associated with node v. Let h be the ray [(-, y), (x, y)];the rectangles of R (v) to be reported are exactly those whose left vertical edge intersectsh. These can be found in optimal time, using the hive-graph of 4.1. Note that bycarrying out the traversal of the graph from (-c, y) to (x, y) we avoid the difficulties


16/22

718 BERNARD CHAZELLEof planar point location (Corollary 1). Indeed, locating (-, y) in the hive-graph canbe done by a simple binary search in a sorted list of y-coordinates, denoted Y/(v).Dealing with query points to the right of L(v) leads to another hive-graph and anotherset of y-coordinates, Yr(v).

The data structure outlined above leads to a straightforward (n, log 2 n)-algorithm.Note that answering a query involves tracing a search path in T from the root to oneof its leaves. This method can be improved by embedding the concept of hive-graphinto a tree structure.

To speed up the searches, we augment Y/(v) and Yr(v) with sufficient informationto provide constant time access into these lists. The idea is to be able to perform asingle binary search in Y/(root)U Y(root), and then gain access to each subsequentlist in a constant number of steps. Let Y(v) be the union of Y/(v) and Y(v). Byendowing Y(v) with two pointers per entry (one for Y/(v) and the other for Y(v)),searching either of these lists can be reduced to searching Y(v). Let wl and w2 be thetwo children of v, and let W1 (resp. W2) be the list obtained by discarding every otherelement in Y(Wl) (resp. Y(w_)). Augment Y(v) by merging into it both lists W andW2. By adding appropriate pointers from Y(v) to Y(w) and Y(w2), it is possible tolook up a value y in Y(w) or Y(w2) in constant time, as long as this look-up hasalready been performed in Y(v).We carry out this preprocessing for each node v of T, proceeding bottom-up.Note that the cross-section of the new data structure obtained by considering the listson any given path is similar to a hive-graph defined with only one refining pass (withdifferences of minor importance). The same counting argument shows that the storageused is still O(n). This leads to an (n, log n) algorithm for point enclosure in twodimensions. We wish to mention that the idea of using a hive-graph within a treestructure was suggested to us by R. Cole and H. Edelsbrunner, independently, in thecontext of planar point location. We conclude:THEOREM 6. There exists an (n, log n)-algorithm for solving the point enclosureproblem in 9.

The algorithm can be easily generalized to handle d-ranges. The data structureTd(S) is defined recursively as follows: project the n d-ranges on one of the axes,referred to as the x-axis, and organize the projections into a segment tree [8]. Recallthat this involves constructing a complete binary tree with 2n-1 leaves, each leafrepresenting one of the intervals delimited by the endpoints of the projections. Withthis representation each node v of the segment-tree "spans" the interval I(v), formedby the union of the intervals associated with the leaves of the subtree rooted at v. Letw be an internal node of the tree and let v be one of its children. We define R(v) asthe set of d-ranges in S whose projections cover I(v) but not I(w). We observe thatthe problem of reporting which of the d-ranges of R(v) contain a query point q whosex-coordinate lies in I(v) is really a (d-1)-point enclosure problem, which we canthus solve recursively. For this reason, we will store in v a pointer to the structureTd_(V), where V is the projection of R(v) on the hyperplane normal to the x-axis.Note that, of course, we should stop the recursion at d 2 and then use the datastructure .of Theorem 6. A query is answered by searching for the x-coordinate in thesegment tree, and then applying the algorithm recursively with respect to each structureTa_(v) encountered. A simple analysis shows that the query time is O(1ogd-1 n + outputsize) and the storage required, O(n loga-2 n).THEOREM 7. There exists an (n loga-2 n, loga- n)-algorithm for solving the pointenclosure problem in a d > 1).


17/22

FILTERING SEARCH 7196. Orthogonal range search problems.6.1. Two-dimensional range search. A considerable amount of work has been

recently devoted to this problem [1], [2], [3], [5], [7], [9], [21], [ 22] , [2 3] , [26], [29],[31], [35], [36]. We propose to improve upon the (n loga-1 n, log a-1 n)-algorithm ofWillard [35], by cutting down the storage by a factor of log log n while retaining thesame time complexity. The interest of our method is three-fold" theoretically, it showsthat, say, in two dimensions, logarithmic query time can be achieved with o(n log n)storage, an interesting fact in the context of lower bounds. More practically, thealgorithm has the advantage of being conceptually quite simple, as it is made of severalindependent building blocks. Finally, it has the feature of being "parametrized", henceoffering the possibility of trade-offs between time and space.When the query rectangle is constrained to have one of its sides lying on one ofthe axes, say the x-axis, we are faced with the grounded 2-range search problem, forwhich we have an efficient data structure, the priority search tree of McCreight [31].We will see how filtering search can be used to derive other optimal solutions to thisproblem ( 6.3, 6.4). But first of all, lets turn back to the original problem. We beginby describing the algorithm in two dimensions, and then generalize it to arbitrarydimensions.

In preprocessing, we sort the n points of S by x-order. We construct a verticalslabs which partition S into a equal-size subsets (called blocks), denoted B1," ", B,from left to right. Each block contains n/a points, except for possibly B, (we assume1 < a


18/22

720 BERNARD CHAZELLER into its


19/22

FILTERING SEARCH 7213. for each point found in step 2, scanning the corresponding block upwards as

long as the y-coordinate does not exceed c, and reporting all the points visited, includingthe starting ones;

4. checking all the points in B and B,, and reporting those satisfying the query.Note that, since the blocks Bi correspond to fixed slabs, their cardinality ma y varyfrom log n to 0 in the course of dynamic operations. We assume here that the candidate

points are known in advance, so that slabs can be defined beforehand. This is the caseif we use this data structure for solving, say, the all-rectangle-intersection problem (see[30] for the correspondence between the two problems). This ne w algorithm clearlyhas the same time and space complexity as the previous one, but it should requiremuch fewer dynamic operations on the priority search tree, on the average. The ideais that to insert/delete a point, it suffices to locate its enclosing slab, and then findwhere it fits in the corresponding block. Since a block has at most log n points, thiscan be done by sequential search. Note that access to the priority search tree is necessaryonly when the point happens to be the lowest in its block, an event which we shouldexpect to be reasonably rare.

6.4. Other optimal schemes. In light of our previous discussion on the hive-graph,it is fairly straightforward to devise another optimal algorithm for the grounded 2-rangesearch problems. Simply extend a vertical ray upwards starting at each point of S. Theproblem is now reduced to intersecting a query segment, [(a, c), (b, c)] with a set ofnonintersecting rays; this can be solved optimally with the use of a hive-graph ( 4).

Incidentally, it is interesting to see how, in the static case, filtering search allowsus to reduce the space requirement of one of the earliest data structures for the ECDF(empirical cumulative distribution function) [7] and the grounded 2-range searchproblems, taking it from O(n log n) to O(n). Coming back to the set of points(al, ill),", (ap, tip), we construct a complete binary tree T, as follows: the root ofthe tree is associated with the y-sorted list of all the p points. Its left (resp. right)child is associated with the y-sorted list {(tl, 1)," ", (t[p/2], [p/EJ)}(resp. {(atp/2j+l,/3tp/Eji/l), ., (ap, tip)}). The other lists are defined recursively in thesame way, and hence so is the tree T. It is clear that searching in the tree for a and binduces a canonical partition of the segment [a, b] into at most 2/log2 p parts, andfor each of them all the points lower than c can be obtained in time proportional tothe output size. This allows us to use the procedure outlined above, replacing step 2by a call to the ne w structure. The latter takes O(p log p)- O(n) space, which givesus yet another optimal (n, log n)-algorithm for the grounded 2-range search problem.

7. Near-neighbor problems. We will give a final illustration of the effectiveness offiltering search by considering the k-nearest-neighbors and the circular range searchproblems. Voronoi diagrams are--at least in theorynprime candidates for solvingneighbor problems. The Voronoi diagram of order k is especially tailored for theformer problem, since it is defined as a partition of the plane into regions all of whosepoints have the same k nearest neighbors. More precisely, VOrk(S) is a set of regionsRi, each associated with a subset Si of k points in S which are exactly the k closestneighbors of any point in R. It is easy to prove that VOrk(S) forms a straight-linesubdivision of the plane, and D. T. Lee has shown ho w to compute this diagrameffectively [25]. Unfortunately, his method requires the computation of all Voronoidiagrams of order -k, which entails a fairly expensive O(k2n log n) running time, aswell as O(k2(n-k)) storage. However, the rewards are worth considering, sinceVOrk(S) can be preprocessed to be searched in logarithmic time, using efficient planarpoint location methods [15], [18], [24], [28]. This means that, given any query point,


20/22

722 BERNARD CHAZELLEwe can find the region Ri where it lies in O(log n) time, which immediately givesaccess to the list of its k neighbors. Note that the O(k2(n-k)) storage mentionedabove accounts for the list of neighbors in each region; the size of VOrk(S) alone isO(k(n- k)). A simple analysis shows that this technique will produce an (n 4, log n)-algorithm for the k-nearest-neighbors problem. Recall that this problem involves prepro-cessing a set S of n points in E 2, so that for any query (q, k)(q E 2, k


21/22

FILTERING SEARCH 723the general scheme of scooping a superset of candidates and then filtering out thespurious items. To demonstrate the usefulness of filtering search, we have shown itsaptitude at reducing the complexity of several algorithms. The following chart summar-izes our main results; the left column indicates the best results previously known, andthe right the new complexity achieved in this paper.

problemdiscrete interval overlapsegment intersection

segment intersection

point enclosure in d(d > 1)orthogonal range query in Rd(d > 1)

k- nearest-neighborscircular range query

previous complexity(n, log n) [ 17 , 30, 31]

(n3, log n) [19](n log n, log n) [34]

(n logd n, logd-t n) [33]

(n logd-t n, logd-t n) [35]

(n 3, log n) [20](n3, log n log log n) [4]

filtering search(n, 1)

n 2, log n(n, log n)

(n logd-2 n, logd-t n)log d-t n logd-1 nlog log n

(n(log n log log n)2, log n) [11](n(log n log log n)2, log n) [11]

In addition to these applications, we believe that filtering search is general enoughto serve as a stepping-stone for a host of other retrieval problems as well. Investigatingand classifying problems that lend themselves to filtering search might lead to interestingnew results. One subject which we have barely touched upon in this paper, and whichdefinitely deserves attention for its practical relevance, concerns the dynamic treatmeritof retrieval problems. There have been a number of interesting advances in the areaof dynamization lately [6], [32], [34], and investigating how these new techniques cantake advantage of filtering search appears very useful. Also, the study of upper orlower bounds for orthogonal range search in two dimensions is important, yet stillquite open. What are the conditions on storage under which logarithmic query time(o r a polynomial thereof) is possible? We have shown in this paper how to achieveo(n log n). Can O(n) be realized ?

Aside from filtering search, this paper has introduced, in the concept of thehive-graph, a novel technique for batching together binary searches by propagatingfractional samples of the data to neighboring structures. We refer the interested readerto [13] for further developments and applications of this technique.

Acknowledgments. I wish to thank Jon Bentley, Herbert Edelsbrunner, JanetIncerpi, and the anonymous referees for many valuable comments.

REFERENCES1] Z. AVID AND E. SHAMIR, A direct solution to range search and related problems for product regions,

Proc. 22nd Annual IEEE Symposium on Foundations of Computer Science, 1981, pp. 123-216.[2] J. L. BENTLEY, Multidimensional divide-and-conquer, Comm. ACM, 23 (1980), pp. 214-229.[3] J. L. BENTLEY AND J. n. FRIEDMAN, Data structuresfor range searching, Comput. Surveys, 11 (1979)4, pp. 397-409.[4] J. L. BENTLEY AND H. A. MAURER, A note on Euclidean near neighbor searching in the plane, Inform.

Proc. Lett., 8 (1979), pp. 133-136.[5], Efficient worst-case data structures for range searching, Acta Inform., 13 (1980), pp. 155-168.

With supporting line of query segment passing through a fixed point, or with.fixed-slope query segment.


22/22

724 BERNARD CHAZELLE[6] J. L. BENTLEY AND J. B. SAXE, Decomposable searching problems. I. Static-to-dynamic transformation,

J. Algorithms, (1980), pp. 301-358.[7] J. L. BENTLEY AND M. I. SHAMOS, A problem in multivariate statistics: Algorithm, data structure andapplications, Proc. 15th Allerton Conference on Communication Control and Computing, 1977,pp. 193-201.

[8] J. L. BENTLEY AND D. WOOD, An optimal worst-case algorithm for reporting intersections of rectangles,IEEE Trans. Comput., C-29 (1980), pp. 571-577.[9] A. BOLOUR, Optimal retrieval algorithms for small region queries, this Journal, 10 (1981), pp. 721-741.[10] B. CHAZELLE, An improved algorithm for the fixed-radius neighbor problem, IPL 16(4) (1983), pp. 193-198.

11 B. CHAZELLE, R. COLE, F. P. PREPARATA AND C. K. YAP, New upper bounds fo r neighbor searching,Tech. Rep. CS-84-11, Brown Univ., Providence, RI , 1984.[12] B. CHAZELLE AND H. EDELSBRUNNER, Optimal solutions for a class of point retrieval problems, J.Symb. Comput., (1985), to appear, also in Proc. 12th ICALP, 1985.[13] B. CHAZELLE AND L. J. GUIBAS, Fractional cascading: a data structuring technique with geometricapplications, Proc. 12th ICALP, 1985.[14] B. CHAZELLE, L. J. GUIBAS AND D. T. LEE, The power of geometric duality, Proc. 24th Annual IEEESymposium on Foundations of Computer Science, Nov. 1983, pp. 217-225.[15] R. COLE, Searching and storing similar lists, Tech. Rep. No. 88 , Computer Science Dept., New YorkUniv., New York, Oct. 1983.[16] R. COLE AND C. K. YAP, Geometric retrieval problems, Proc. 24th Annual IEEE Symposium onFoundations of Computer Science, Nov. 1983, pp. 112-121.17 ] H. EDELSBRUNNER, A time- and space- optimal solution for the planar all-intersecting-rectangles problem,Tech. Rep. 50, IIG Technische Univ. Graz, April 1980.[18] H. EDELSBRUNNER, L. GUIBAS AND J. STOLFI, Optimalpoint location in a monotone subdivision, thisJournal, 15 (1986), pp. 317-340.[19] H. EDELSBRUNNER, D. G. KIRKPATRICK AND H. A. MAURER, Polygonal intersection searching, IPL,14 (1982), pp. 74-79.[20] H. EDELSBRUNNER, J. OROURKE AND R. SEIDEL, Constructing arrangements oflines and hyperplaneswith applications, Proc. 24th Annual IEEE Symposium on Foundation of Computer Science, Nov.1983, pp. 83-91.[21] R. A. FINKEL AND J. L. BENTLEY, Quad-trees: a data structure for retrieval on composite keys, ActaInformat. (1974), pp. 1-9.[22] M. L: FREDMAN, A lower bound on the complexity of orthogonal range queries, J. Assoc. Comput.Mach., 28 (1981), pp. 696-705.[23] ., Lower bounds on the complexity of some optimal data structures, this Journal, 10 (1981), pp. 1-10.[24] D. G. KIRKPATRICK, Optimal search in planar subdivisions, this Journal, 12 (1983), pp. 28-35.[25] D. Y. LEE, On k-nearest neighbor Voronoi diagrams in the plane, IEEE Trans. Comp., C-31 (1982),pp. 478-487.[26] D. T. LEE AND C. K. TONG, Quintary trees: A file structure for multidimensional data base systems,ACM Trans. Database Syst., (1980), pp. 339-353.[27] W. LIrsK AND F. P. PREPARATA, Segments, rectangles, contours, J. Algorithms, 2 (1981), pp. 63-76.[28] R. J. LIroN AND R. E. TARJAN, Applications of a planar separator theorein, this Journal, 9 (1980),pp. 615-627.[29] G. LUEKER, A data structure for orthogonal range queries, Proc. 19th Annual IEEE Symposium onFoundations of Computer Science, 1978, pp. 28-34.[30] E. M. MCCREIGHT, Efficient algorithms for enumerating intersecting intervals and rectangles, Tech. Rep.Xerox PARC, CSL-80-9, 1980.[31], Priority search trees, Tech. Rep. Xerox PARC, CSL-81-5, 1981; this Journal 14 (1985), pp. 257-276.[32] M. H. OVERMARS, The design of dynamic data structures, Ph.D. thesis, Univ. Utrecht, 1983.[33] V. K. VAISHNAVI, Computing point enclosures, IEEE Trans. Comput., C-31 (1982), pp. 22-29.[34] V. K. VAISHNAVI AND D. WOOD, Rectilinear line segment intersection, layered segment trees anddynamization, J. Algorithms, 3 (1982), pp. 160-176.

[35] D. E. WILLARD, New data structures for orthogonal range queries, this Journal, 14 (1985), pp. 232-253.[36] A. C. YAO, Space-time trade-offfor answering range queries, Proc. 14th Annual ACM Symposium onTheory of Computing, 1982, pp. 128-136.[37] F. F. YAO, A 3-space partition and its applications, Proc. 15th Annual ACM Symposium on Theory ofComputing, April 1983, pp. 258-263.

filtering search a new approach to query-answering

Documents