finding the closest pair in the query region › sites › default › files › ...finding the...
TRANSCRIPT
COMP4905 Honours Project
Winter2010
Finding the Closest Pair in the Query Region
Author: Yan Gao Student#: 100750466
Email: [email protected] Supervisor: Dr. Michiel Smid
Department: School of Computer Science Date: April 15, 2010
ABSTRACT
This project demonstrates the implementation of a c program to construct a data structure for
finding the closest pair points in a query region. To achieve this goal, the algorithm of finding the
closest points in plane and the data structure of range tree are used. The project contains two
parts, the first to build the data structure of the closest points, and the second to search in the
region. Then we test the running time of each part to measure the data structure’s performance.
Acknowledgment
This honours project was done under the supervision of Dr. Michiel Smid. I would like to extend
special thanks to him for all of the help he has given me, and all I have learned from him while
working with him.
Thanks to Hsin-Yi Chiang in the project programming and Benjamin C. Yan in the writing of the
report.
1
Contents
List of figures ................................................................................................................. 2
List of tables ................................................................................................................... 3
1. Introduction ................................................................................................................ 4
2. Algorithms.................................................................................................................. 6
2.1 Converting the edges to the tree....................................................................... 6
2.2 Range Search Tree ........................................................................................... 8
2.2.1 Build the Range Search Tree................................................................. 8
2.2.2 1-Dimensional Range Searching ......................................................... 10
2.2.3 2-Dimensional Range Searching ......................................................... 11
2.3 Compute the graph ......................................................................................... 14
3. Program .................................................................................................................... 18
3.1 Main function implementation ....................................................................... 18
3.2 Important functions ........................................................................................ 21
4. Running time ............................................................................................................ 27
4.1 Convert the graph ........................................................................................... 27
4.2 Build the range search tree ............................................................................. 29
4.3 Build the data structure .................................................................................. 31
4.4 Query time ..................................................................................................... 33
5. Conclusion ............................................................................................................... 35
6. References ................................................................................................................ 37
2
List of figures
Figure 1 Finding the closest pair in the query region ............................................ 5
Figure 2 The relationship between edges and the query region ............................. 6
Figure 3 Edge convert rule ..................................................................................... 7
Figure 4 Convert the edges to weighted points ...................................................... 8
Figure 5 Define the range on 1D............................................................................ 8
Figure 6 1D range tree ........................................................................................... 9
Figure 7 1D-tree range search .............................................................................. 10
Figure 8 2D range tree ......................................................................................... 12
Figure 9 2D-tree range search .............................................................................. 13
Figure 10 Grid the plane ...................................................................................... 14
Figure 11 The relation between cells and query points........................................ 15
Figure 12 Compute the min-edges ....................................................................... 16
Figure 13 Main process of the program ............................................................... 19
Figure 14 Data structures used in the program .................................................... 20
Figure 15 Convert the tree to an array ................................................................. 23
Figure 16 Function: treeRoot() ............................................................................ 24
Figure 17 Relations between the indexes and the tree nodes ............................... 24
Figure 18 Find the left and right children of each internal node ......................... 25
Figure 19 Function: min_weight() ....................................................................... 26
Figure 20 Running time of converting the graph ................................................. 29
Figure 21 Running time of build y-trees .............................................................. 30
Figure 22 Running time of building tree.............................................................. 31
Figure 23 Running time of building the data structure ........................................ 32
Figure 24 Running time of 1000 query points ..................................................... 34
Figure 25 2 query points region searching ........................................................... 35
3
List of tables
Table 1 Running time of convert the graph ....................................................... 29
Table 2 Running time of building tree ............................................................... 31
Table 3 Running time of building the data structure ......................................... 32
Table 4 Running time of 1000 query points ...................................................... 34
4
1. Introduction
Finding the closest pair is a famous computational geometry problem. It means given n
points in some K-dimension space, find two points with the shortest distance between
them.[1]
A range-tree is a data structure for the organization of points, and is useful in
multidimensional key searching problems such as range searching. It’s a special case of a
binary search tree. [2]
This project discusses a problem in a 2-dimension space: given a plane with x and y
coordinate range, n points in the plane with corresponding coordinates, a random point in
the plane, and the query region being the upper right area of the point, find the closest pair
of points in the query region. As shown in Fig1, the light blue region is the query region;
the red line should be output as the shortest edge. What was done in this project was to
build a data structure such that when the query region is given, the data structure can be
searched and output the answer.
5
Figure 1 Finding the closest pair in the query region
To build the data structure, the problem is partitioned into two parts: first, to convert the
original points graph to all possible shortest edges in the plane; second, to build a data
structure for the query searching. For the first part, we calculate all possible shortest edges
base on the “planer case” algorithm [1], its running time is O(n log n), which is better than
brute-force algorithm’s running time O(n2). For the second part of the problem, we use the
2D range tree to store the possible edges (the smallest distance between two points). Then
the data structure can be searched by x and y coordinates.
Based on the algorithms, a C program was written to implement the data structure, test the
algorithms running time in the real program, and verify the actual time versus the predicted
running time. Analysis of the data structure is helpful for the query searching.
In this paper, we introduce the algorithms chosen in this project, and discuss some
important functions used in the program, then analyze the running time of building the data
structure and searching for the query.
6
2. Algorithms
2.1 Converting the edges to the tree
In order to tackle the problem of finding the closest pair, the first step is to convert the
edges to the tree. Assume we have a graph of edges in a plane. For each edge, the
coordinates of its two ends are known. Given a query region, the shortest edge will be
output. For example, the plane S with x ∈ 0, RANGE , y ∈ [0, RANGE], with RANGE is
a positive integer. Generate a random query point q = (xq , yq), the query region is the
area with x ∈ [xq , RANGE], y ∈ [yq , RANGE].
Figure 2 The relationship between edges and the query region
In Fig.2, the edges are categorized into cases based on the border of the query region:
7
inside, outside and crossing. Only the inside edges are satisfied. When we want to know an
edge’s position of the query region, we need to check the two ends of the edge are both
inside the query region. In fact, when the data structure is built, the two end positions for
each edge should be stored in the memory. If the data structure is used for just one query
region, the time to check the two points is not too long. But when the data structure is
used for randomly query regions, the running time and the memory will double.
Therefore, we use a better method to get the edges we want.
For any arbitrary edge 𝑝𝑞 , as shown in Fig.3, it can be grouped either in case (a) or case (b).
Given 𝑝(𝑝𝑥 ,𝑝𝑦), 𝑞(𝑞𝑥 ,𝑞𝑦), get a point r 𝑟𝑥 , 𝑟𝑦 , which 𝑟𝑥 = min 𝑝𝑥 ,𝑞𝑥 , 𝑟𝑦 =
𝑚in(𝑝𝑦 , 𝑞𝑦 ). If the edge is like case(a), r is the right angle of the triangle; if the edge is like
case(b), r is the left bottom end. Therefore, we can convert the edges to points, each points
has a weight equal to the length of the edge. So Fig.2 is transferred to Fig.4.
Figure 3 Edge convert rule
8
Figure 4 Convert the edges to weighted points
Therefore, we convert the finding minimum edge length problem to the finding minimum
weight problem. For the latter problem, we can use the range search tree algorithm.
2.2 Range Search Tree
The range search tree is for searching the minimum distance of all possible distances. It
bases on the binary search tree data structure, and implemented in two dimensions. When
we get the query, we use it as the range for searching.
2.2.1 Build the Range Search Tree
Assume we have a set of points 𝑝1,𝑝2,𝑝3,⋯ , 𝑝𝑛 in 1-dimension, then for each point, it
has a value called weight. Given a region [u, v], we want to report the minimum weight of
the points in the region.
Figure 5 Define the range on 1D
9
If we use brute-force algorithm to find the region, it takes O(n) time, but it can’t be used in
higher dimensions. So we choose the binary search tree data structure, the leaves of the tree
T store the points of P and the internal nodes of T store splitting values to guide the search.
We denote the splitting value stored at a node v by 𝑥𝑣. We assume that the left subtree of a
node contains all the points smaller than or equal to𝑥𝑣, and that the right subtree contains
all the points strictly greater than 𝑥𝑣. [1]
Assuming we have a set of points P, each element in P has a coordinate value and a weight,
and all the elements are sorted by coordinate values (from small to large). For example, if:
P := {(1:20), (2:13), (3:15), (4:7), (5:32), (6:11),(7:8),(8:22)},
then we can get the binary search tree as shown in Fig.6:
Figure 6 1D range tree
Each leaf is the original node in the set P, the upper number indicates the coordinate value,
10
and the bottom number is the weight of the point, which comes with the point. For the
internal nodes, the upper number is the splitting value, and the bottom number is the
minimum weight of its subtree. In Fig.6, we also know all splitting values are from the
original points.
2.2.2 1-Dimensional Range Searching
Base on the algorithm of Berg [3]. Given the tree and a query region 𝑢, ∞ ), we want to
report the minimum weight of the points in the region.
Figure 7 1D-tree range search
First, we search for the split node, which is the root of the smallest subtree from 𝑢, ∞ ).
FINDSPLITNODE (𝑻,𝒖)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑡𝑟𝑒𝑒 𝑻 𝑎𝑛𝑑 𝑎 𝑞𝑢𝑒𝑟𝑦 𝒖
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡h𝑒 𝑛𝑜𝑑𝑒 𝒗 𝑤h𝑒𝑟𝑒 𝑡h𝑒 𝑝𝑎𝑡h 𝑡𝑜 𝑢 𝑎𝑛𝑑 𝑡h𝑒 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑛𝑜𝑑𝑒 𝑠𝑝𝑙𝑖𝑡.
1. 𝑣 ← 𝑟𝑜𝑜𝑡(𝑇)
2. 𝒊𝒇 𝑢 > max 𝑛𝑜𝑑𝑒
3. 𝒓𝒆𝒕𝒖𝒓𝒏 𝒆𝒓𝒓𝒐𝒓
4. 𝒘𝒉𝒊𝒍𝒆 𝑣 𝑖𝑠 𝑛𝑜𝑡 𝑎 𝑙𝑒𝑎𝑓 𝑎𝑛𝑑 𝑢 ≥ 𝑥𝑣
5. 𝒅𝒐 𝑣 ← 𝑟𝑖𝑔h𝑡𝐶h𝑖𝑙𝑑(𝑣)
6. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑣
11
If the query node is in the range of the tree, we can find a path from the split node to the
query leaf. For each internal node on the path, we should check the direction of the next
step, until we get the end of the path, and report the minimum weight of all collected
weights. For example, if we go to its left child, we collect its right child’s weight, because
its subtree is in the query region.
1DRANGESEARCH (𝑻,𝒖)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑡𝑟𝑒𝑒 𝑻, 𝑎 𝑞𝑢𝑒𝑟𝑦 𝒖
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡h𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑤𝑒𝑖𝑔h𝑡 𝑖𝑛 𝑡h𝑒 𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑔𝑖𝑜𝑛
1. 𝑣 ← 𝐹𝐼𝑁𝐷𝑆𝑃𝐿𝐼𝑇𝑁𝑁𝑂𝐷𝐸(𝑇,𝑢)
2. 𝒊𝒇 𝑣 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑓
3. 𝒕𝒉𝒆𝒏 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔ℎ𝑡
4. 𝒆𝒍𝒔𝒆(∗ 𝐹𝑜𝑙𝑙𝑜𝑤 𝑡h𝑒 𝑝𝑎𝑡h 𝑡𝑜 𝑢 ∗)
5. 𝒘𝒉𝒊𝒍𝒆 𝑣 𝑖𝑠 𝑛𝑜𝑡 𝑎 𝑙𝑒𝑎𝑓
6. 𝒅𝒐 𝒊𝒇 xv ≥ 𝑢
7. 𝒕𝒉𝒆𝒏 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣 ′𝑠 𝑟𝑖𝑔h𝑡𝐶h𝑖𝑙𝑑𝑤𝑒𝑖𝑔ℎ𝑡
8. 𝑣 ← 𝑙𝑒𝑓𝑡 𝑐h𝑖𝑙𝑑(𝑣)
9. 𝒆𝒍𝒔𝒆 𝑣 ← 𝑟𝑖𝑔h𝑡 𝑐h𝑖𝑙𝑑(𝑣)
10. 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔ℎ𝑡
11. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔h𝑡𝑠
2.2.3 2-Dimensional Range Searching
Base on the algorithm of Berg [3]. If the points are in the plane that we want to get the
shortest distance in the query region, we need a 2-dimensional range search tree.
The set P is all points in the plane. The main tree is a range search tree T built on the
x-coordinate of the points in P. For any node v in T, the subset P(v) is the points stored in
the subtree with root v. Build the range search tree Tassoc (v) (we can also call it y-tree) on
the y-coordinate of the points P(v), and node v knows how to get the root of Tassoc (v).
12
Figure 8 2D range tree
BUILD2DRANGETREE(P)
Input: a set P of points in the plane
Output: the root of a 2D range search tree
1. 𝐵𝑢𝑖𝑙𝑑𝑅𝑎𝑛𝑔𝑒𝑇𝑟𝑒𝑒 𝑻 𝑏𝑦 𝑥 − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒
2. 𝑭𝒐𝒓 𝑒𝑎𝑐h 𝑛𝑜𝑑𝑒 𝒗 𝑖𝑛 𝑻
3. 𝑃 𝑣 ← 𝑡h𝑒 𝑠𝑢𝑏𝑡𝑟𝑒𝑒 𝑜𝑓 𝒗
4. 𝐵𝑢𝑖𝑙𝑑 𝑅𝑎𝑛𝑔𝑒𝑇𝑟𝑒𝑒 𝑇𝑎𝑠𝑠𝑜𝑐 (𝑣)
5. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑡h𝑒 𝑟𝑜𝑜𝑡 𝑜𝑓 𝑻
Once 2D range tree is built, we can use the algorithm 2DRangeSearch to find the minimum
weight in the query rectangle region. First, we search for the points which x-coordinates
are greater than or equal to the query region. Therefore, we can get the split node and the
path to query x in the 2D tree. Then we start from the split node, if the next step goes to the
left, we get the right child’s y-tree and do 1Drange search, until the query leaf node is
reached.
13
Figure 9 2D-tree range search
2DRangeSearch 𝑻,𝒒
𝐼𝑛𝑝𝑢𝑡: 𝑎 2𝐷 𝑡𝑟𝑒𝑒 𝑻,𝑎 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝒒
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑔𝑖𝑜𝑛
1. 𝑣 ← 𝐹𝐼𝑁𝐷𝑆𝑃𝐿𝐼𝑇𝑁𝑂𝐷𝐸(𝑇, 𝑞𝑥)
2. 𝒊𝒇 𝑣 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑓
3. 𝒕𝒉𝒆𝒏 𝒊𝒇 𝑣𝑦 ≥ 𝑞𝑦 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔ℎ𝑡
4. 𝒆𝒍𝒔𝒆 (∗ 𝐹𝑜𝑙𝑙𝑜𝑤 𝑡ℎ𝑒 𝑝𝑎𝑡ℎ 𝑡𝑜 𝑞𝑥 ∗)
5. 𝒘𝒉𝒊𝒍𝒆 𝑣 𝑖𝑠 𝑛𝑜𝑡 𝑎 𝑙𝑒𝑎𝑓
6. 𝒅𝒐 𝒊𝒇 𝑣𝑥 ≥ 𝑞𝑥
7. 𝒕𝒉𝒆𝒏 𝑤 ← 𝑟𝑖𝑔ℎ𝑡𝐶ℎ𝑖𝑙𝑑(𝑣)
8. 1𝐷𝑅𝑎𝑛𝑔𝑒𝑆𝑒𝑎𝑟𝑐ℎ( 𝑤𝑦−𝑡𝑟𝑒𝑒 , 𝑞𝑦 )
9. 𝑣 ← 𝑙𝑒𝑓𝑡𝐶ℎ𝑖𝑙𝑑(𝑣)
10. 𝒆𝒍𝒔𝒆 𝑣 ← 𝑟𝑖𝑔ℎ𝑡𝐶ℎ𝑖𝑙𝑑(𝑣)
11. 𝒊𝒇 𝑣𝑦 ≥ 𝑞𝑦 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔ℎ𝑡
12. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔ℎ𝑡𝑠
14
2.3 compute the graph
Based on the algorithms above, we can compute the shortest edge in the query region by
steps,
1. convert the set of edges E into set of points P with weight (weight = distance between
two original points)
2. build 2D range tree T for P
3. search the minimum weight point in T in the query region.
This algorithm introduces how to compute the graph of edges.
First, we have a set P of all points in the plane S, each point has two values which indicate
x-coordinate value and y-coordinate. We make a horizontal line and a vertical line for each
point, and then get an n × n grid (n is the number of the points). The query points are
located in the plane, so it must be in one of the n2 cells.
Figure 10 Grid the plane
15
If we look into one cell that contains the query point, and if the query point in the shaded
area (except the left and bottom side), the query region contains the same points, meaning:
If query u1 gets S1’, query u2 gets S2’, query u3 gets S3’,
then S1’ = S2’ = S3’.
Figure 11 The relation between cells and query points
By Fig.11, we can find the shaded area of the grid; and if the query point is in the grid, the
result is always same as the left-bottom corner of the grid. So we can get at most n2 results.
If there are some points in the plane sorted by x-coordinate, get the minimum distance from
right to left. We know the minimum distance d in points [i+1, n-1] already, when move to
the next left point p, we just check whether the points whose x-coordinate distance with p is
smaller than d, then choose the smaller one as the new d.
16
Figure 12 Compute the min-edges
MINEDGE( P, d)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑃,𝑎 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑒𝑑𝑔𝑒 𝑑
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡h𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑒𝑑𝑔𝑒
1. 𝒊𝒇 (𝑝1 .𝑥 − 𝑝0 . 𝑥) ≥ 𝑑
2. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑑
3. 𝒘𝒉𝒊𝒍𝒆 (𝑝𝑖 . 𝑥 − 𝑝0 .𝑥 ≤ 𝒅 & 𝑖 ≤ 𝑠𝑖𝑧𝑒 𝑃 )
4. 𝑑′ ← 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑝𝑖 ,𝑝0
5. 𝒊𝒇 𝑑′ < 𝑑
6. 𝐭𝐡𝐞𝐧 d ← 𝑑′
7. 𝑖 + +;
8. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑑
If we can find the minimum edge in an x-sorted array, we can compute the graph with both
x-coordinate and y-coordinate. First sort the original points by y-coordinate and get the
subset from the end to the beginning of the set. Then sort the subset by x-coordinate, get its
subset from the right side to the left. Compute each subset’s minimum edge, add to an array
and remove duplicates at the same time.
17
COMPUTEGRAPH(P)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑷
𝑂𝑢𝑡𝑝𝑢𝑡:𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑒𝑑𝑔𝑒𝑠 𝑎𝑟𝑟𝑎𝑦 𝑬
1. 𝑆𝑜𝑟𝑡 𝑃 𝑏𝑦 𝑦 − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒
2. 𝒇𝒐𝒓 𝑖 = 𝑛 − 2 𝑡𝑜 0
3. 𝑔𝑒𝑡 𝑡ℎ𝑒 𝑠𝑢𝑏𝑠𝑒𝑡 𝑃 𝑖 ← { 𝑝𝑖 ,𝑝𝑖+1,⋯ ,𝑝𝑛−1}
4. 𝐿 ← 𝑆𝑜𝑟𝑡 𝑃 𝑖 𝑏𝑦 𝑥 − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒
5. 𝒇𝒐𝒓 𝑘 = 𝑛 − 𝑖 − 1 𝑡𝑜 0
6. 𝑔𝑒𝑡 𝑡ℎ𝑒 𝑠𝑢𝑏𝑠𝑒𝑡 𝐿 𝑘 ← { 𝑙𝑘 , 𝑙𝑘+1 ,⋯ , 𝑙𝑛−𝑖−1}
7. min ← 𝑓𝑖𝑛𝑑𝑀𝑖𝑛𝐸𝑑𝑔𝑒(𝐿(𝑘))
8. 𝑎𝑑𝑑 𝑚𝑖𝑛 𝑡𝑜 𝐸
18
3. Program
The program is developed in C programming language, and tested on Linux operating
system.
3.1 Main function implementation
The main function is to implement the data structure to find the closest pair in the query
region. There are 4 steps to realize the requirement (Fig.13).
STEP 1: Assume the plane is a square with an integer RANGE as maximum x- and y-
coordinate values. User inputs the total point number and the plane range, so the plane is a
RANGE × RANGE square. Then the program generates the points: for each point, “rand()”
function gets two integers, which are divided by RANGE. The remainders are the x- and
y-coordinates of the point; this process ensures all points are in the plane. All points are
stored in the array “*a_h” with type “point”.
STEP 2: Compute the graph. The points in array “a_h” are computed by algorithm in
Chapter 2.3, and then we get all possible edges. All these edges are stored in an array
named “a_e” with type “edge”. In this step, we convert the edges to points, described in
chapter 2.1.
STEP 3: Based on the edges, the 2DRangeSearchTree can be built. The tree is stored as an
array called “tree_a” with type “treeNode”. Now we have the complete data structure.
STEP 4: Based on the data structure, we use the way which generates the original points
19
(described in STEP 1) to generate the query points, and search for the minimum weight in
the tree, which is equal to get the shortest distance pair in the query region. The answers
will be output. Because the query points may be out of the original points range, the data
structure may not find the answer; therefore, the answers may be “out of range”.
Figure 13 Main process of the program
For programming, there are three types of data designed: point, edge, and treeNode:
20
Figure 14 Data structures used in the program
point
Points are used to present the original point. There are two integer variables x and y that
represent x-coordinate and y-coordinate respectively.
edge
Edges are used to present the relation between two points. The variables x1, y1, x2 and y2
represent the two ends of the edge; x3=min{ x1, x2}, y3=min{y1, y2}, which is the point
after the edge converted(algorithm by chapter 2.1); and the variable weight shows the
distance between the two ends, which is also the length of the edge. Actually, x1, y1, x2, y2
are not necessary for the data structure, they are used for checking whether the algorithm is
21
correct or not.
treeNode
This data structure is used in the range search tree. The variables x, y store the converted
edge’s location; minWeight is the minimum weight of the node’s subtree. If the node is a
leaf, minWeight is its own weight; the variable “leaf” is used as a Boolean. When it is 1, the
node is a leaf in the tree; when it is 0, the node is an internal node. “Left”, “right”, and
“parent” indicate the indices of the node’s left-child, right-child and parent. If the node is a
leaf, it doesn’t have children. If the node is the root of the tree, it doesn’t have a parent; in
this situation, we set its parent to -1. The pointer “*y_t” points to the corresponding Y-tree.
When the “treeNode” structure is used for a Y-tree node, we won’t assign this value. And
the variable “root” and “size” store its y-tree’s root index and size. They are designed for
searching conveniently.
3.2 Important functions
Merge sort[4]
Merge-sort is used in this program in two times, one is to sort the original points, and the
other is to sort the edges. It’s an O(n log n) sorting algorithm.
MergeSort( L, s)
Input: an array L and its size s
Output: the sorted array L
if s ≤ 1
return L;
m = s/2
22
left ← L 0, m ;
right ← L[ m + 1, s − 1];
left =MergeSort(left);
right = MergeSort(right);
L = Merge(left, right);
return L;
Merge(left, right)
Input: two array left and right
Output: an sorted array result contains left ∪ right
Array result;
While size(left)>0 & size(right)>0
If left(first) < right(first)
Append left(first) to result
Left ← left(rest)
else
Append right(first) to result
right ← right(rest)
if size(left) >0
append left to result
else append right to result
return result
Based on this algorithm, we can sort the array by x-coordinate or y-coordinate. When
sorting by the x-coordinate, if the two elements have the same x coordinates, compare the
y-coordinates. When sorting by y-coordinate, vice versa.
Build binary search tree
This function transfers the edge array to the treeNode array. Since the internal node’s
children may be a leaf or an internal node, if we use pointer structure, we should define two
kinds of data types. Therefore, we use an array to store all tree nodes, and use the index to
23
indicate each one.
For example, there is a range search tree in Fig.15(a). if we move down the internal nodes
vertically to the leaves level.(Fig 15(b)) then we get an array as shown in the red rectangle.
If the edges number is n, the tree array size is 2n-1, since the last edge is not copied twice.
So we can copy the first n-1 elements of the edges array twice to the tree array, and the last
edge once. Because each node in the tree should be assigned a value, each node takes O(1)
time to get value, the running time of build the tree array is O(n).
Figure 15 Convert the tree to an array
24
Find the root of the tree
This function finds the internal nodes of the tree, and return index of the root node. We
divide the array by two equal parts, and the middle node is the root of the tree. Then run the
function recursively to find each subtree’s root.
Figure 16 Function: treeRoot()
Figure 17 Relations between the indexes and the tree nodes
For example (Fig.17), there is a tree with size 15 stored in the array. The root of the tree
is the node with index 7 =0+14
2. If we look into the internal node “2”, which has index 3,
its subtree is from the index 0 to 6. Same as the internal node “6”, its subtree contains the
node from index 8 to 14. That means the root’s index of the subtree or tree is the middle
25
of its subtree’ nodes’. Therefore, we define the smallest index in a subtree as variable
“left” and the largest index as “right”, the root’s index of this subtree as “mid”,
mid = left +right
2.
When a tree’s root is found, the tree is divided into two subtrees. We can use the same
way to find their root, the left subtree’s root is the root’s left child, and the right subtree’s
root is the root’s right child. When a subtree is size 1, it cannot be divided any more, that
means the node is a leaf in the tree.(Fig18)
Figure 18 Find the left and right children of each internal node
Fig.18 shows how the function works. The little blue squares are the roots of its
subtrees(trees). The root is the middle of its subtree nodes indexes, once it’s found, the
tree is divided into two subtree, and find their root then return to the original root as left
child and right child. When the left pointer and right pointer are pointing to the same
node, the node is a leaf and returns its index to its parent.
26
Find the minimum weight of the subtree
In the range search tree data structure, each node has a minWeight value which indicates
the minimum weight of its subtree. When the node is a leaf, its subtree is only itself; the
minWeight is its original weight. If the node is an internal node, which means it has 2
children, the minWeight is the smaller weight of its children’s. Therefore, we can find the
minWeight of each node recursively.
In this function, input the tree array and the root index. If the root node is a leaf, return its
weight; else check its two children’s minWeight, and return the smaller one. Actually, the
function is running from the leaves level to the root. Each leaf report its weight to its parent,
parent choose a smaller weight in its two children’s as its minWeight, then report to its
parent, and so on until the root gets the minWeight of the whole tree.
Figure 19 Function: min_weight()
27
4. Running time
The running time result is test on the computer with processor Intel Atom N450 1.66GHz
and 1GB RAM. The system is operated by Ubuntu Linux with the kernel version of
2.6.19.
4.1 Convert the graph
To test the time of convert the graph is from the original points set to get all the possible
query edges. It contains 3 steps. Assuming there are n points in the plane, the running time
of each step is shown as follow:
1. Generate points
To generate the points, the elements in the point array should be assigned x and y
values. For each value, first we generate a random integer number, and then divide
by the plane’s RANGE equal to the maximum coordinate in the plane; the remainder
is the value. All these steps run in constant time, therefore each point need O 1
running time, and the total running time is O n .
2. Merge sort the points by y-coordinate
As discussed in Chapter 3.2, merge sort running time isO(n log n).
3. Compute the graph
Based on the algorithm in chapter 2.3, the pseudo code of compute the graph is:
28
COMPUTEGRAPH(P)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑃
𝑂𝑢𝑡𝑝𝑢𝑡:𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑒𝑑𝑔𝑒𝑠 𝑎𝑟𝑟𝑎𝑦 𝐸
1. 𝑆𝑜𝑟𝑡 𝑃 𝑏𝑦 𝑦 − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 O(n log n)
2. 𝑓𝑜𝑟 𝑖 = 𝑛 − 2 𝑡𝑜 0 O(n)
3. 𝑔𝑒𝑡 𝑡𝑒 𝑠𝑢𝑏𝑠𝑒𝑡 𝑃 𝑖 ← { 𝑝𝑖 ,𝑝𝑖+1 ,⋯ ,𝑝𝑛−1} O(n − i + 1)
4. 𝐿 ← 𝑆𝑜𝑟𝑡 𝑃 𝑖 𝑏𝑦 𝑥 − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 O( n − i log(n − i))
5. 𝑓𝑜𝑟 𝑘 = 𝑛 − 𝑖 − 1 𝑡𝑜 0 O(n − i)
6. 𝑔𝑒𝑡 𝑡𝑒 𝑠𝑢𝑏𝑠𝑒𝑡 𝐿 𝑘 ← { 𝑙𝑘 , 𝑙𝑘+1 ,⋯ , 𝑙𝑛−𝑖−1} O(n − k − i)
7. min ← 𝑓𝑖𝑛𝑑𝑀𝑖𝑛𝐸𝑑𝑔𝑒(𝐿(𝑘)) O(n − k − i)
8. 𝑎𝑑𝑑 𝑚𝑖𝑛 𝑡𝑜 𝐸 O(n)
running time (ComputeGraph) = O(n log n) + O(n) × {O n − i + 1 + O((n − i) log(n − i))
+O n − i × [O n − k − i + O n − k − i + O n ]}
= O(n log n) + O(n) × [O n log n) + O n2
= O(n log n) + O n2 log n + O n3
= O(n3)
Therefore, the total running time of converting the graph is the sum of every step’s running
time, equaling:
O n + O(n log n) + O n3 = O(n3)
Run the program with the special points numbers, each number is tested 10 times, and get
the average running time. The result is shown as Tab.1
Points Number Running time(second)
100 0.000000
200 0.029000
500 0.157000
1000 0.620000
2000 2.708000
29
5000 18.685000
8000 58.112000
Table 1 Running time of convert the graph
Figure 20 Running time of converting the graph
4.2 Build the range search tree
As mentioned previously, to build the 2D binary tree has 4 steps, assume we have n edges
to build the tree, the running time of the algorithms is:
1. Generate the edges and sort it by x-coordinate use merge sort: O n log n
2. Convert the edges to tree nodes and build the tree array: O 2n = O(n)
3. Find the tree root, and assign the minimum weight for each node in x-tree: O(n log n)
4. Build Y-tree for each node in x-tree
0
10
20
30
40
50
60
70
1002005001000 2000 5000 8000
convert graph
30
Figure 21 Running time of build y-trees
In Fig.21, the triangle represents the x-tree with size n. The first level(root) node builds
y-tree; the running time to build a size n y-tree’s O(n log n). The second level has 2
nodes, each of them has subtree with size n
2; the total time to build the second level
y-trees is 2 × O n
2log
n
2 = O n log
n
2 ≤ O n log n . The third level has 4 nodes, their
subtrees’ size is n
4; the total time to build the third level y-trees is 4 × O
n
4log
n
4 =
O(n logn
4) ≤ O(n log n). For ith level, the number of nodes is 2i−1, each subtree has
size n
2i−1, the total time to build the ith level’s y-trees is
#nodes × time of one node y − tree, which is
2i−1 × O n
2i−1log
n
2i−1 = O n log
n
2i−1 ≤ O n log n . Therefore, each level’s running
time is O n log n , and there are log n levels, so the total running time to build the
corresponding y-trees is O n log n × log n = O(n log2n)
31
Edge Number Running time(second)
1000 0.015000
5000 0.083000
10000 0.166000
50000 0.953000
100000 2.073000
500000 14.562000
1000000 37.652000
Table 2 Running time of building tree
Figure 22 Running time of building tree
4.3 Build the data structure
Based on the time complexity above, the total time of building the data structure starts from
the original points and ends up with the tree built complete. Chapter 4.2 shows that to build
a tree may take a long time when the number of edges is large. Actually, the number of
possible edges is not as many when computing an original plane. Usually the number of
0
5
10
15
20
25
30
35
40
100050001000050000100000 500000 1000000
buildTree
buildTree
32
edges is less than 100, when the original point number reaches the upper bound. Therefore,
the running time of build the tree is too small, and the running time to build the data
structure is close to the time of computing the graph. For each point number, we run the
code 10 times, get the edge numbers and running times, and calculate the average values.
Number of points Number of edges Running time(s)
100 23.4 0.004
200 29.1 0.025
500 35.4 0.151
1000 46.5 0.629
2000 51.5 2.663
5000 61.7 19.340
8000 63.0 57.948
Table 3 Running time of building the data structure
Figure 23 Running time of building the data structure
From Tab 3, as the point number increases, the edge number increases slowly, and the
running time is close to the values in Tab1.
0
10
20
30
40
50
60
70
1002005001000 2000 5000 8000
build the data structure
build the data structure
33
4.4 Query time
Based on chapter 2.2 algorithm
1DRANGESEARCH (𝑻,𝒖)
𝐼𝑛𝑝𝑢𝑡: 𝑎 𝑡𝑟𝑒𝑒 𝑻, 𝑎 𝑞𝑢𝑒𝑟𝑦 𝒖
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡h𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑤𝑒𝑖𝑔h𝑡 𝑖𝑛 𝑡h𝑒 𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑔𝑖𝑜𝑛
1. 𝑣 ← 𝐹𝐼𝑁𝐷𝑆𝑃𝐿𝐼𝑇𝑁𝑁𝑂𝐷𝐸 𝑇,𝑢 𝑂(log𝑛)
2. 𝒊𝒇 𝑣 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑓
3. 𝒕𝒉𝒆𝒏 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔ℎ𝑡 𝑂(1)
4. 𝒆𝒍𝒔𝒆(∗ 𝐹𝑜𝑙𝑙𝑜𝑤 𝑡𝑒 𝑝𝑎𝑡 𝑡𝑜 𝑢 ∗)
5. 𝒘𝒉𝒊𝒍𝒆 𝑣 𝑖𝑠 𝑛𝑜𝑡 𝑎 𝑙𝑒𝑎𝑓 𝑂(log𝑛)
6. 𝒅𝒐 𝒊𝒇 𝑣 ≥ 𝑢
7. 𝒕𝒉𝒆𝒏 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣 ′𝑠 𝑟𝑖𝑔𝑡𝐶𝑖𝑙𝑑𝑤𝑒𝑖𝑔ℎ𝑡 𝑂(1)
8. 𝑣 ← 𝑙𝑒𝑓𝑡 𝑐𝑖𝑙𝑑 𝑣 𝑂(1)
9. 𝒆𝒍𝒔𝒆 𝑣 ← 𝑟𝑖𝑔𝑡 𝑐𝑖𝑙𝑑 𝑣 𝑂(1)
10. 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔 𝑡 𝑂(1)
11. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔h𝑡𝑠 O(1)
Therefore, the running time of 1DRangeSearch is O log n .
2DRangeSearch 𝑻,𝒒
𝐼𝑛𝑝𝑢𝑡: 𝑎 2𝐷 𝑡𝑟𝑒𝑒 𝑻,𝑎 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝒒
𝑂𝑢𝑡𝑝𝑢𝑡: 𝑡𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑤𝑒𝑖𝑔𝑡 𝑖𝑛 𝑡𝑒 𝑞𝑢𝑒𝑟𝑦 𝑟𝑒𝑔𝑖𝑜𝑛
1. 𝑣 ← 𝐹𝐼𝑁𝐷𝑆𝑃𝐿𝐼𝑇𝑁𝑂𝐷𝐸 𝑇, 𝑞𝑥 𝑂(log𝑛)
2. 𝒊𝒇 𝑣 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑓
3. 𝒕𝒉𝒆𝒏 𝒊𝒇 𝑣𝑦 ≥ 𝑞𝑦 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣weight 𝑂(1)
4. 𝒆𝒍𝒔𝒆 (∗ 𝐹𝑜𝑙𝑙𝑜𝑤 𝑡𝑒 𝑝𝑎𝑡 𝑡𝑜 𝑞𝑥 ∗)
5. 𝒘𝒉𝒊𝒍𝒆 𝑣 𝑖𝑠 𝑛𝑜𝑡 𝑎 𝑙𝑒𝑎𝑓 𝑂(log𝑛)
6. 𝒅𝒐 𝒊𝒇 𝑣𝑥 ≥ 𝑞𝑥
7. 𝒕𝒉𝒆𝒏 𝑤 ← 𝑟𝑖𝑔𝑡𝐶𝑖𝑙𝑑 𝑣 𝑂(1)
8. 1𝐷𝑅𝑎𝑛𝑔𝑒𝑆𝑒𝑎𝑟𝑐 𝑤𝑦−𝑡𝑟𝑒𝑒 , 𝑞𝑦 𝑂(log𝑛)
9. 𝑣 ← 𝑙𝑒𝑓𝑡𝐶𝑖𝑙𝑑 𝑣 𝑂(1)
10. 𝒆𝒍𝒔𝒆 𝑣 ← 𝑟𝑖𝑔𝑡𝐶𝑖𝑙𝑑 𝑣 𝑂(1)
11. 𝒊𝒇 𝑣𝑦 ≥ 𝑞𝑦 𝑟𝑒𝑝𝑜𝑟𝑡 𝑣𝑤𝑒𝑖𝑔 𝑡 𝑂(1)
12. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔h𝑡𝑠 𝑂(1)
Therefore, the running time of 2DRangeSearch is O( log n 2).
34
Build the 2D range search trees with edge numbers 1000, 5000, 10000, 50000, 100000,
500000, 1000000. For each test, generate 1000 query points to get the answer time. Every
edge number should be tested 10 times, and get the average answer time.
Edge Number Running time of 1000 query points(second)
1000 0.009
5000 0.013
10000 0.014
50000 0.027
100000 0.029
500000 0.292
1000000 1.022
Table 4 Running time of 1000 query points
Figure 24 Running time of 1000 query points
0
0.2
0.4
0.6
0.8
1
1.2
100050001000050000100000 500000 1000000
query time
query time
35
5. Conclusion
This project implements the algorithms to find the closest pair in the query region.
In this project, only a lower bound was used in searching the binary search tree, but
actually, the algorithms can be used for searching in a range that has both lower and
upper bounds. In that way, the query region may be formed by two query points, and the
two points are the two ends of the diagonal line of the rectangle query region.(fig.25)
Figure 25 2 query points region searching
Based on the running time analysis, to build the data structure takes a long time, but the
running time of query searching in the data structure is very fast. Basically, the actual
running time curves are close to the theoretical value, the differences are caused by the
memory access and the hardware access, thread management and some machine
operation. In this project, we generate the points randomly, and every searching time is
tested based on different points set, so the total process appears inefficient. Actually, it
36
may be applied for some specially given graph (such as a data base or a map). Therefore,
although it costs a long time to build the data structure for a large number, the searching
process won’t be as long.
37
6. References
[1] Closest pair of points problem [online]
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
[2] range tree [online] http://en.wikipedia.org/wiki/Range_tree
[3] M. de Berg, O. Cheong, M. van Kreveld, M. Overmars. “Orthogonal Range Searching
querying in a database”, Computational Geometry: Algorithms and Applications.
Springer-Verlag Berlin Heidelberg, 3nd edition, 2008, pp 96-108.
[4] merge-sort [online] http://en.wikipedia.org/wiki/Merge_sort