1. introduction - university of calgary in alberta · 2014-10-17 · keywords: spatial clustering;...

CLUSTERING SPATIAL DATA IN THE PRESENCE OF OBSTACLES

XIN WANG and HOWARD J. HAMILTON

Department of Computer Science

University of Regina, Regina, Sask, Canada S4S 0A2

{wangx, hamilton}@cs.uregina.ca

Dealing with constraints due to obstacles is an important topic in constraint-based spatial clustering. In this paper, we proposed the DBRS_O method to identify clusters in the presence of intersected obstacles. Without doing any preprocessing, DBRS_O processes the constraints during clustering. DBRS_O can also avoid unnecessary computations when obstacles do not affect the clustering result. As well, DBRS_O can find clusters with arbitrary shapes, varying densities, deal with significant non-spatial attributes and handle large datasets.

Keywords: Spatial Clustering; Constraint-based Clustering; Obstacle.

1. Introduction

Clustering large amounts of data with two spatial dimensions to find hidden patterns or meaningful sub-groups has many applications, such as resource planning, marketing, image processing, animal population migration analysis, and disease diffusion analysis. Dealing with constraints due to obstacles is an important topic in constraint-based spatial clustering. For example, when we analyze cougar population migration, rivers might be barriers to migration. Handling these constraints can lead to effective and fruitful data mining by capturing application semantics [1, 2].

Typically, a clustering task consists of separating a set of objects into different groups according to a measure of goodness, which may vary according to the application. A common measure of goodness is Euclidean distance (i.e., straight-line distance). However, in many applications, the use of Euclidean distance has a weakness, as illustrated by the following example.

Suppose a bank planner is to identify appropriate locations for bank machines for the area shown in the map in Figure 1(a). In the map, a small oval represents a residence.

1

Each highway acts as an obstacle that separates nearby residences into two groups corresponding to the two sides of the highway. The highways intersect each other. If we do not consider highways as obstacles, simple Euclidean distance can be used to measure the distance between the residences, which gives the four clusters shown in Figure 1(b). However, for pedestrians, highways are obstacles. Since obstacles exist in the area and they should not be ignored, the simple Euclidean distances among the objects are not appropriate for measuring user convenience when planning locations for bank machines. If obstacles are considered, clustering should give the seven clusters shown in Figure 1 (c).

Legend

residence

highway

(a) (b) (c)

Fig. 1. (a) Map for ATM placements (b) Clusters without considering highways (c) Clusters with highways considered as obstacles

In this paper, we are interested in clustering in two dimensional spatial datasets with

the following features. First, obstacles, such as highways, fences, and rivers, may exist in the data. Obstacles are modeled as polygons (including convexes and concaves), and do not occur densely. Second, obstacles may intersect each other. For example, highways or rivers often cross each other. Third, the clusters may be of widely varying shapes and densities. For example, services are clustered along major streets or highways leading into cities. Fourth, the points, i.e., two- or three-dimensional locations, may have significant non-spatial attributes. Finally, the dataset may be very large, say more than 100 000 points.

We extend the density-based clustering method DBRS [3] to handle obstacles and call the extended method DBRS_O. In the presence of obstacles, DBRS_O first randomly picks one of the points from the dataset and retrieves its neighborhood without considering obstacles. Then it determines whether any obstacles in the neighborhood should be considered. An obstacle may occur within a neighborhood without affecting the reachability of points from the center point of neighborhood. If so, DBRS_O ignores the obstacle. However, if an obstacle separates the neighborhood into multiple connected regions, DBRS_O removes all points that are not in the same region as the central point from the neighborhood. When multiple obstacles appear in a neighborhood, DBRS_O first finds all obstacles whose combination with other obstacles may separate the

2

neighborhood. For each such obstacle, DBRS_O removes all points whose connection with the center point intersects this obstacle. If the remaining neighborhood is dense enough or intersects an existing cluster, the neighborhood is clustered. Otherwise, the center point is classified as noise. These steps are iterated until all points have been clustered or classified as noise.

The remainder of this paper is organized as follows. In Section 2, we discuss three related approaches. In Section 3, the DBRS_O algorithm is introduced and its complexity analysis is given. Experimental results are analyzed in Section 4. Conclusions and suggestions for future research are given in Section 5.

2. Related Work

In this section, we briefly survey previous research on the problem of spatial clustering in the presence of obstacles. We survey the COD_CLARANS, AUTOCLUST+, and DBCLuC methods.

2.1. COD_CLARANS

COD_CLARANS [2] was the first partitioning clustering method that could handle obstacle constraints. It is a modified version of the CLARANS partitioning algorithm [5] that has been adapted for clustering in the presence of obstacles. The main idea is to replace the Euclidean distance function between two points with the "obstructed distance," which is the length of the shortest Euclidean path between two points that does not intersect any obstacles. The calculation of the obstructed distance includes three preprocessing steps. The first step is to build a visibility graph. A visibility graph is a undirected graph whose vertices correspond to vertices of the obstacles and whose edges are generated if and only if the connection between the corresponding vertices of obstacles does not intersect any obstacles. The second step of preprocessing is micro-clustering. Instead of representing each point in the micro-cluster individually, COD_CLARANS represents them using their center and a count of the number of points in the micro-cluster. The third preprocessing step materializes three spatial join indexes, the VV index, the MV index, and the MM index. The materialization of the VV index computes the all-pairs shortest paths in the visibility graph, that is, the VV index gives the shortest paths that do not intersect any obstacle, between every two obstacle vertices. Materializing the MV index computes an index entry for every pair of a micro-cluster and an obstacle vertex. The MM index is materialized by computing the obstructed distance between every pair of micro-clusters. However, the size of MM index may be too huge to be materialized [2]. These three indices are used whenever it is necessary to calculate the distance between any two points. The computational cost of the preprocessing steps was ignored in the performance evaluation [2].

After preprocessing, COD_CLARANS works efficiently on a large number of obstacles. But, for the types of datasets we described in Section 1, the algorithm may not be suitable. First, if the dataset has varying densities, COD_CLARANS's micro-clustering approach may not be suitable for the sparse clusters. For the clusters shown in

3

Figure 2, the left cluster is much denser than the other two clusters. If we use a small radius (as shown by the larger circles in Figure 2) to micro-cluster three clusters, the micro-clustering process does not have much effect on the majority of the points in the two clusters on the right side. However, if we were to pick a larger radius, the microclusters would mistakenly merge the two clusters on the right side into one cluster. Secondly, as given, COD_CLARANS was not designed to handle intersecting obstacles, such as are present in our datasets (e.g., the highways shown in Figure 1), because the model used in preprocessing for determining visibility and building the spatial join index would need to be changed significantly.

Fig. 2. Micro-clustering is not applicable for clusters of varying densities

2.2. AUTOCLUST+

AUTOCLUST+ [6] is an enhanced version of AUTOCLUST [7] that handles obstacles. AUTOCLUST+ is a nice graph-based clustering algorithm. With AUTOCLUST+, the user does not need to supply parameter values. First, AUTOCLUST+ constructs a Delaunay diagram. A Delaunay diagram is the dual of a Voronoi diagram [8]. A Voronoi diagram is a partitioning of a plane with n points into n convex polygons such that each polygon contains exactly one point and every point in a given polygon is closer to its central point than to any other. In a Delaunay diagram, an edge (called a Delaunay edge) is present between two points if and only if their corresponding Voronoi regions share a boundary. Then, for each point, the standard deviation of the lengths of the edges directly connected (incident) to the point is calculated. A global variation indicator, the average of these standard deviations, is then calculated before considering any obstacles. Thirdly, any edge that intersects any obstacle is deleted. Fourthly, AUTOCLUST is applied to the planar graph resulting from the previous steps.

When a Delaunay edge traverses an obstacle, the length of the distance between the two end-points of the edge is approximated by a detour path between the two points. The dotted lines in Figure 3 represent Delaunay edges obstructed by the obstacle. The thick, solid lines in Figure 3 (a) and (b) illustrate the detour distance and the detour path between point 2 and point 4. However, the detour path cannot always approximate a detour distance. That is, the distance is not defined if no detour path exists between the obstructed points. For the case shown in Figure 3 (c) the detour distance between point 2

4

and point 4 is same as case in (a). However, there is no corresponding detour path can be found. Ignoring the detour distance between the points could decrease the quality of clustering results.

7

6

5

4

3

12

7

6

5

4

30

12

7

6

5

4

3 0

1 2

(a) (b) (c)

Fig. 3. (a) Detour distance (b) Detour path (c) No detour path

2.3. DBCLuC

DBCLuC [9, 10], which is based on DBSCAN [11], is a density-based clustering algorithm that can handle obstacle constraints. Instead of finding the shortest path between the two objects by traversing the edges of the obstacles, DBCLuC determines visibility by using obstruction lines. In the preprocessing, a convexity test is first applied to all obstacle polygons. The convexity test includes two steps: first, for each vertex of the polygon, an assessment edge is constructed. An assessment edge of a vertex is a line segment whose two end vertices respectively lie in two adjacent edges of the vertex and which does not intersect the polygon. Secondly, each vertex is labeled as either a convex vertex or a concave vertex. A vertex is a convex vertex if its assessment edge is interior to a polygon. Otherwise, it is a concave vertex. If there is a concave vertex in a polygon, the polygon is a concave polygon. Otherwise, it is a convex polygon. For each polygon, the convex vertices are bi-partitioned (divided into two groups) according to an enumeration order, such as clockwise from a specified vertex. For convex polygons, an obstruction line is drawn between each convex vertex in one partition and the corresponding convex vertex in the other partition. In order to construct a set of obstruction lines for a concave polygon, an admission test needs to be done for each possible obstruction line candidate. If a possible obstruction line candidate is not admissible (i.e. totally outside of the obstacle polygon or intersecting it), a set of obstruction lines are generated by finding the shortest path between the two vertices to replace the candidate and then the admission test is done again on each lines in the set.

After preprocessing, DBCLuC is an effective density-based clustering approach for large datasets containing obstacles with many edges. However, constructing obstruction lines is expensive for concave polygons with many vertices, because the complexity is O(v2), where v is the number of convex vertices in obstacles. As well, if reachability between any two points is defined to mean not intersecting any obstruction line, the algorithm will not work correctly for the example shown in Figure 4. The circle

5

represents the neighborhood of the center point. Point p and the center are blocked by an obstacle with the obstruction lines shown, but the shortest distance between these points is actually less than the radius. DBCLuC will conclude that the points are not reachable, when in fact they are.

Obstruction lines

Fig. 4. A case where DBCLuC may fail

p

To summarize, the above three algorithms are all good algorithms for datasets with constant density and no intersecting obstacles. However, for the datasets with varying densities, with intersecting obstacles, or having cases similar to those discussed above, they are not appropriate.

3. DBRS in the Presence of Obstacles

In this section, we describe a density-based clustering approach called DBRS_O that considers obstacle constraints. First, in Section 3.1, we briefly review DBRS, on which DBRS_O is based. Then, we define a new distance function for use in the presence of obstacles. The connectedness of a neighborhood and the determination of false neighbors are discussed in Sections 3.3 and 3.4, respectively. The handling of multiple obstacles is discussed in Section 3.5. Finally, the DBRS_O algorithm and its complexity analysis are presented in Section 3.6.

3.1. Density-based spatial clustering with random sampling (DBRS)

DBRS is a density-based clustering method with three parameters, Eps, MinPts, and MinPur [3]. DBRS repeatedly picks an unclassified point at random and examines its neighborhood, i.e., all points within a radius Eps of the chosen point. The purity of the neighborhood is defined as the percentage of the neighbor points with the same property as the central point. A core point is a point whose matching neighborhood is dense enough, i.e., it has at least MinPts points and over MinPur percent of matching neighbors. A border point is a neighbor of a core point that is not a core point itself. Points other than core and border points are noise. If the point is not a core point and its neighborhood is disjoint with all known clusters, the point is classified as noise. Otherwise, if any point in the neighborhood is part of a known cluster, this neighborhood is joined to that cluster, i.e., all points in the neighborhood are classified as being part of the known cluster. If neither of these two possibilities applies, a new cluster is begun with this neighborhood.

Figure 5 presents the DBRS algorithm. D is a set of data points. DBRS starts with an arbitrary point q and finds its matching neighborhood qseeds with

6

D.matchingNeighbors(q,Eps) in line 4. If the size of qseeds is less than MinPts, the purity is less than MinPur and qseeds does not intersect any known clusters, then q is noise or a border point, but it is tentatively classified as noise (-1) in line 6. If q is a core point, the algorithm checks whether its neighborhood intersects any known cluster. The clusters are organized in a list called ClusterList. If qseeds intersects a single existing cluster, DBRS merges qseeds into this cluster. If qseeds intersects two or more existing clusters, the algorithm merges qseeds and those clusters together, as shown in lines 11-18. Otherwise, a new cluster is formed from qseeds in lines 19-21. Algorithm DBRS(D, Eps, MinPts, MinPur)

1 ClusterList = Empty;

2 while (!D.isClassified( ))

3 { Select one unclassified point q from D;

4 qseeds = D.matchingNeighbors(q, Eps);

5 if (((|qseeds| < MinPts) or (qseeds.pur < MinPur)) and

(qseeds.intersectionWith(ClusterList) == Empty))

6 q.clusterID = -1; /*q is noise or a border point */

7 else

8 { isFirstMerge = True;

9 Ci = ClusterList.firstCluster;

/* compare qseeds to all existing clusters */

10 while (Ci != Empty)

11 { if ( hasIntersection(qseeds, Ci) )

12 if (isFirstMerge)

13 { newCi = Ci.merge(qseeds);

14 isFirstMerge = False; }

15 else

16 { newCi = newCi.merge(Ci);

17 ClusterList.deleteCluster(C);}

18 Ci = ClusterList.nextCluster;

} // while != Empty

/*No intersection with any existing cluster */

19 if (isFirstMerge)

20 { Create a new cluster Cj from qseeds;

21 ClusterList = ClusterList.addCluster(Cj);

} //if isFirstMerge

} //else

} // while !D.isClassified

Fig. 5. DBRS Algorithm

The region query D.matchingNeighbors(q,Eps) in line 4 is the most time-consuming part of the algorithm. A region query can be answered in O(log n) time using

7

spatial access methods, such as R*-trees [12] or SR-trees [13]. In the worst case, where all n points in a dataset are noise, DBRS performs n region queries, and its time complexity is O(n log n).

DBRS is an efficient algorithm for very large databases. It can work on varying shapes and densities. In addition, it also handles non-spatial attributes in the database. Details concerning DBRS are given in [3].

In the following discussion, we assume all points have the same property to simplify the discussion. For simplicity of presentation, only 2D examples are shown.

3.2. The distance function in the presence of obstacles

In DBRS, the distance between two points p and q, denoted as dist(p, q), is computed without considering obstacles. But when obstacles appear in a neighborhood, the reachability among points could be blocked by obstacles. So a new distance function called unobstructed distance is defined. Definition 1: The unobstructed distance between two points p and q in a region R, denoted by distR

ob(p, q), is defined as the shortest path between p and q in R without intersecting any obstacles. The path is composed of a sequence of points, (v0,…, vn) in which v0 = p, vn = q, and vi (0< i < n) are the vertices of obstacles. If the shortest path between p and q cannot be found within R, then distR

ob(p, q) is defined as ∞. If the region R is a neighborhood area, the neighbors can be classified into three

different types according to the unobstructed distance. Definition 2: q is called an obvious neighbor of p in the neighborhood area R of p iff dist(p, q) = distR

ob(p, q). Definition 3: q is called an unobvious neighbor of p in the neighborhood area R of p iff (1) distR

ob(p, q) ≠ ∞ and (2) dist(p, q) ≠ distRob(p, q).

Definition 4: q is called a false neighbor of p in the neighborhood area R of p iff distR

ob(p, q) = ∞. Informally, an obvious neighbor is a neighbor that can be reached from the central

point by an unobstructed straight line segment. An unobvious neighbor is a neighbor that is not an obvious neighbor, but that can be reached unobstructedly through the obstacle vertices inside the neighborhood. The distance between an unobvious neighbor and the central point may or may not be less than the radius. A false neighbor is a neighbor that cannot be reached from the central point unobstructedly through the obstacle vertices inside the neighborhood. It is not reachable without leaving the neighborhood. The distance between a false neighbor and the central point is definitely greater than Eps.

In Figure 6, p is the central point and q1, q2, q3, q4 and q5 are five points in its neighborhood. However, as shown in Figure 6, part of an obstacle intersects the neighborhood. Since dist(p, q1) and dist(p, q2) are less than the radius and no obstacle intervenes, q1 and q2 are obvious neighbors. Although the Euclidean distances from q3 and q4 to p are less than radius, they are obstructed by the obstacle. We can reach q3 or q4 from p via one of the vertices of the obstacle without leaving the neighborhood. In the diagram, it appears that q3 is reachable from p in a distance less than radius, while the

8

corresponding unobstructed distance from p to q4 may be slightly more than the radius. Detailed calculations are needed to determine whether unobvious neighbors are in fact unobstructed neighbors. q3 and q4 are unobvious neighbors of p. For q5, the obstacle ensures that it is not reachable without leaving the neighborhood, so it cannot be p's neighbor. We call it a false neighbor of p.

q3 q2

q5

q4 q1

p

Fig. 6. q1 and q2 are p’s obvious neighbors, q3 and q4 are p's unobvious neighbors, and q5 is p's false neighbor

The unobstructed distance between an unobvious neighbor (or a false neighbor) and the central point consists of the distance between itself and the first connection point, the distance between consecutive connection points in the path, and the distance between the last connection point and the central point. All connection points are vertices of obstacles. There may be more than one path between the neighbor and the central point. Therefore, the distance from an unobvious neighbor (or a false neighbor) to the central point depends on several possible paths. Thus the problem is transformed to one of finding the shortest path between the central point and the neighbor through those connections points. Dijkstra’s algorithm [14] for finding the shortest path can be used; its complexity is O(|v|2), where |v| is the total number of vertices in all obstacles. If the obstacles intersect each other, the visibility among those vertices needs to be determined. The complexity of determining the visibility is O(|v|2) as well.

In this paper, we approximate the unobstructed distance function between the unobvious neighbors and the center by treating them as obvious neighbors within a neighborhood area, so the unobstructed distance function becomes:

∞

=otherwise

neighbor unobviousor neighbor obvious s' is ),(),(

pqqpdist

qpdist obR

The distance between an obvious neighbor and the central point is the Euclidean distance dist(p, q) which is less than or equal to Eps. Unobvious neighbors are treated as obvious neighbors for practical reasons. First, in density-based clustering methods, the radius Eps is usually set as a very small value to guarantee the accuracy and significance of the result. Thus it is reasonable to approximate the internal distance within the connected neighborhood. Secondly, as mentioned before, to find the distance between an unobvious neighbor and the center requires O(|v2|) time, where v is the set of total vertices of obstacles. However, to distinguish the unobvious neighbors and obvious neighbor only takes O(log|v|). So approximation decreases the complexity from O(log|v|+|v|2) to O(log|v|), where |v| is the number of the vertices of all obstacles. We sacrifice some

9

accuracy but reduce the complexity significantly. However, if neighborhood is separated into several connected regions, false neighbors must be removed. So the distance from false neighbors to central point is defined as ∞.

Given a dataset D, a symmetric distance function dist, the unobstructed distance function in a neighbor R distR

ob, parameters Eps and MinPts, and a property prop defined with respect to one or more non-spatial attributes, the definitions of a matching neighbor and reachability in the presence of obstacles are as follows. Definition 5: The unobstructed matching neighborhood of a point p, denoted by N'Eps_ob(p), is defined as N'Eps_ob(p) = { q∈D | distR

ob(p, q)≤ Eps and p.prop = q.prop}, and its size is denoted as |N'Eps_ob(p)|. Definition 6: A point p and a point q are directly purity-density-reachable in the presence of obstacles if (1) p∈ N'Eps-ob(q), |N'Eps-ob(q)| ≥ MinPts and |N'Eps-ob(q)| / |NEps(q)| ≥ MinPur or (2) q∈ N'Eps-ob(p), |N'Eps-ob(p)| ≥ MinPts and |N'Eps-ob(p)| / |NEps(p)| ≥ MinPur. NEps(p) is the matching neighborhood of point p without considering obstacles. Definition 7: A point p and a point q are purity-density-reachable (PD-reachable) in the presence of obstacles, denoted by PDob(p, q), if there is a chain of points p1,…,pn, p1=q, pn=p such that pi+1 is directly purity-density-reachable from pi in the presence of obstacles.

To determine reachability in the presence of obstacles, we must first determine whether false neighbors may exist within the neighborhood, i.e., whether or not the region is connected. Secondly, we must identify the false neighbors, i.e., we must find which points are separated from the central point. We will discuss these two problems in the next two sections.

3.3. Connectedness of the neighborhood

We determine whether or not a neighborhood is connected by an analysis of obstacles and the neighborhood. There are eight possible spatial relations between two spatial regions [15], including disjoint, touch, equal, inside (and its inverse contain), cover (and its inverse covered by), and overlap. Since we assume that no data point occurs within an obstacle, we do not consider equal further. Since we are not using a data retrieval technique based on minimum bounded regions (MBRs), we also do not consider cover and its inverse covered by, which are defined with respect to the MBR. Given a neighborhood containing a single obstacle, eight possible topologically different cases are shown in Figure 7. For each case, the polygon represents an obstacle, the circle represents the neighborhood, and the black dots represent the points being clustered. The obstacle and the neighborhood are two spatial regions. The obstacle may be disjoint from the neighborhood, as shown in (a). The obstacle may touch the neighborhood with one or more obstacle vertices, as shown in (b), or with one or more obstacle edges, as shown in (c). The obstacle may overlap with the neighborhood region with one edge, as shown in (d), or one or more sequences of consecutive edges, as shown in (e). If the obstacle is small with respect to the neighborhood, it may be inside the neighborhood region, as shown in (f). This relationship corresponds to saying that the neighborhood contains the obstacle. For cases (a)-(f), the neighborhood is connected in spite of the presence of the

10

obstacle. All neighbors are obvious neighbors or unobvious neighbors, neither of which requires special treatment. For cases with two or more intersecting edges, as shown in (g) and with two or more separate sequences of consecutive edges, as shown in (h), the neighborhood is separated by the obstacle into two parts, so some points may be separated from the central point. That is, false neighbors only appear when the neighborhood region is separated into two or more parts.

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 7. Possible topologies for a neighborhood and an obstacle

The following theorem addresses this problem. Theorem 1: Given a neighborhood containing exactly one obstacle with set of vertices V and set of edges E. Let |v| be the number of vertices of the obstacle inside the neighborhood (i.e., strictly less than Eps from the central point) and let |e| be the number of the edges inside or intersecting the neighborhood. If |e| ≤ |v| + 1 holds, then the neighborhood is connected. Proof. We use induction to prove the theorem. Step 1: When |v| = 0, |e| = 0, as shown in Figure 7 (a) and (b), the neighborhoods are connected. When |v| = 0, |e| = 1, as shown in Figure 7 (c) and (d), the neighborhoods are connected. Step 2: Now assume that for some integer k ≥ 0, the theorem holds, i.e., |v| = k, |e| ≤ k + 1, the neighborhood is connected. Then for |v| = k + 1, assume that the new vertex q is inserted between two vertices pm and pm+1. We need to connect two edges (+2) to q from pm and pm+1, respectively, and we need to delete the old edge between pm and pm+1. So |e| ≤ k + 1 + 2 - 1= (k + 1) + 1. Since the new vertex is inside the neighborhood, the new edges cannot cut the neighborhood into separate connected regions. So the theorem holds for |v| = k + 1.

The theorem implies when there are multiple sequences of line segments intersecting the neighborhood, the neighborhood may be separated into multiple connected regions. It

11

gives a sufficient but not necessary condition. Using this theorem can avoid many unnecessary calculations.

3.4. False neighbor detection

After determining the connectedness, we can identify the false neighbors by using virtual polygons for obstacles and neighborhoods. A virtual polygon for an obstacle and a neighborhood is a polygon that (1) includes one sequence of line segments generated by intersecting the obstacle with the neighborhood and (2) connects the two ending vertices of the sequence of line segments without intersecting the neighborhood. If we imagine that one of the sequences of line segments is part of the virtual polygon represented by the thick lines, as shown in Figure 8, the false neighbor is inside the polygon while the central point is outside the polygon.

False neighbor

Central point

Virtual Polygon

Fig. 8. A neighborhood with a virtual polygon

Therefore, the problem of judging whether a point is separated from the central point is transformed into the problem of determining whether the point is inside the virtual polygon (i.e., in a different region from the central point) or outside (i.e., in the same region).

Several methods are known to test whether a point is inside a polygon [16]. We use the grid method, which is a relatively faster but more memory intensive method. The idea is to connect a ray from the testing point to central point. Then we count the number of times the ray crosses a line segment. If the number of crossings is odd, then the point is inside the polygon. If the number of crossings is even, then the point is outside the polygon. Figure 9 shows five example rays and their counts.

Since the two ends of the ray, i.e., the neighbor and the central point, are both inside the neighborhood, the line segments being crossed can only be the edges of the obstacle.

Fig. 9. Grid method

12

Therefore, we do not generate a real polygon for counting the crossings. Instead, we only find the sequences of line segments intersecting the neighborhood.

3.5. Multiple obstacles in a neighborhood

To deal with combinations of multiple obstacles, one approach is to merge all obstacles in a preprocessing stage. However, this approach could complicate the problem unnecessarily. First, the result could be a complex polygon, i.e., a polygon that consists of an outline plus optional holes. For the case shown in Figure 10(a), the combination of four polygons is a complex polygon with a “#” shape and an internal ring. For a complex polygon, the operations are usually more costly than for a simple polygon. Secondly, the resulting polygon could have more vertices than the original polygon, which could make further operations more costly because intersection points will also need to be treated as vertices. For example, in Figure 10(a), the number of vertices in the original four simple polygons is 16, but the number of vertices in the complex polygon is 32.

In DBRS_O, when multiple obstacles appear in a neighborhood, we first determine one by one whether they are crossing obstacles, i.e., they affect the neighborhood connectedness by using Theorem 1. If any of them affect the result, we remove the false neighbors obstructed by the obstacle by counting the corresponding crossings. For the case in Figure 10(a), we can remove every false neighbor generated by different obstacles.

For noncrossing obstacles, combinations with other obstacles may affect the connectedness, as shown by the dark obstacle polygons in Figure 10(b) and (c). To detect these cases, we first determine whether a noncrossing obstacle appears in the neighborhood and intersects other obstacles. Since such obstacles have sequences of line segments, we can determine whether they intersect by checking the line segments one by one.

(a) (b) (c)

Fig. 10. Multiple intersecting obstacles

To simplify processing, in all cases where a noncrossing obstacle intersects another obstacle, we remove any neighbor for which the straight line from it to the center point intersects the noncrossing obstacle. Since a neighborhood is a small area, if a noncrossing obstacle intersects others in the neighborhood, it is likely that the neighborhood is separated by a combination of obstacles. Even if the neighborhood is not separated, the distance from some unobvious neighbors to center may be longer than the radius. Since

13

the number of the noncrossing obstacles intersecting in the neighborhood is typically small, the additional computation required will be small.

3.6. Algorithm and complexity analysis

In Figure 11, we present the DBRS_O algorithm. D is the dataset. Eps, MinPts and MinPur are global parameters. O is the set of obstacles present in the dataset. DBRS_O starts with an arbitrary point q and finds its matching neighborhood in the presence of obstacles, which is called qseeds, through the function D.matchingProperNeighbors(q,Eps,O) in line 4, which is the only difference from the DBRS algorithm previously given in Figure 5.

Algorithm DBRS_O (D, Eps, MinPts, MinPur, O) 1 ClusterList = Empty;

2 while (!D.isClassified( )) {

3 Select one unclassified point q from D;

4 qseeds = D.matchingProperNeighbors(q, Eps, O);

5 if (((|qseeds| < MinPts) or (qseed.pur < MinPur)) and

(qseeds.intersectionWith(ClusterList) == Empty))

6 q.clusterID = -1; /*q is noise or a border point */

7 else {

8 isFirstMerge = True;

9 Ci = ClusterList. firstCluster;

/* compare qseeds to all existing clusters */

11 while (Ci != Empty) {

11 if ( Ci.hasIntersection(qseeds) )

12 if (isFirstMerge) {

13 newCi = Ci.merge(qseeds);

14 isFirstMerge = False; }

15 else {

16 newCi = newCi.merge(Ci);

17 ClusterList.deleteCluster(Ci);}

18 Ci = ClusterList. nextCluster; }

/*No intersection with any existing cluster */

19 if (isFirstMerge) {

20 Create a new cluster Cj from qseeds;

21 ClusterList = ClusterList.addCluster(Cj); }

} //else

} // while !D.isClassified

Fig. 11. The DBRS_O algorithm

14

SetOfPoint::matchingProperNeighbours(q, Eps, O) 1 SequencesOfLineSegments *solsOfAffectOb, solsOfAllOb;

2 int intersections;

3 qseeds= matchingNeighbours(q, Eps);

4 O.IntersectWithCircle(q,Eps, solsOfAffectOb, solsOfAllOb);

5 for every point seed in qseeds

6 for every sequence of line segments sls in solsOfAffectOb

7 { intersection = sls.IntersectWithLine(seed,q);

8 if (intersection % 2 != 0)

9 { delete seed from qseeds;

10 break; }

11 for every obstacle o which does not affect connectedness

12 if(o.IntersectionWithOthers(solsOfAllOb))

13 for every point seed in qseeds

14 if ( o.isPossibleNeighbor(q, seed))

15 delete seed from qseeds;

16 return(qseeds);

Fig. 12. The matchingProperNeighbors function

The function for finding the matching proper and unobvious neighbors is presented in Figure 12. The matchingNeighbours(q,Eps) function in line 3 retrieves all matching neighbors without considering the obstacles [3]. In line 4, according to Theorem 1, O.IntersectWithCircle(q, Eps, solsOfAffectOb, solsOfAllOb) records all sequences of line segments that separate the neighborhood in solsOfAffectOb. For each neighbor seed in qseeds and every sequence of line segments sls in solsOfAffectOb, sls.IntersectWithLine(seed, q) in line 7 counts the intersections of sls and the line segment connecting seed and q. If the intersections is odd, seed is a false neighbor. It is deleted from qseeds in line 9. In line 12, o.IntersectionWithOthers(solsOfAllOb) determines whether an obstacle o that does not affect neighborhood connectedness intersects any other obstacles. From line 13 to 15, all unobvious neighbors intersecting o are removed by connecting every seed with the center point and checking the intersection.

The region query D.matchingNeighbors(q,Eps) in line 3 of Figure 12 is the most time-consuming part of the algorithm. A region query can be answered in O(log n) time using SR-trees. Similarly, the method of checking the intersection of every obstacle with neighborhood region O.IntersectWithCircle(q,Eps) requires O(log|v|) time, where |v| is the number of all vertices of obstacles. In the worst case, where all n points in a dataset are noise, DBRS_O needs to perform exactly n region queries and n neighborhood intersection queries. Therefore, the worst case time complexity of DBRS_O in the presence of obstacles is O(n log n+ nlog|v|). However, if any clusters are found, it will perform fewer region queries and fewer neighborhood intersection queries.

15

4. Experimental Results

This section presents our experimental results on synthetic and real datasets. All experiments were run on a 500MHz PC with 256M memory. To improve the efficiency of neighborhood queries, an SR-tree was implemented as the spatial index.

Each synthetic dataset includes three attributes (the x, y coordinates and one non-spatial property) and 10-30 clusters. The synthetic data were generated using a data generator according to specified values for the density, the shape, and the location of the clusters, and the number of points in each cluster. The obstacle datasets were generated manually. The data generator first generated a data point randomly, and then checked whether this data point was inside an obstacle polygon. Any point inside an obstacle polygon was not inserted in the dataset. The procedure was repeated until the cluster reached the specified number of points. The synthetic datasets used for our experiments are available from our website [17]. For all experiments on synthetic datasets, Eps was 4, MinPts was 10, and MinPur was 0.75; extensive previous experiments with DBRS showed that these values give representative behaviors for the algorithm [3]. The accuracy of clustering results was measured by the ratio of the number of points in the discovered cluster to the intended cluster. Time was measured as elapsed time on a dedicated computer. Each reported time represents the average value from 10 runs.

Figure 13 shows a 150k dataset with 10% noise and the clusters found by DBRS and DBRS_O. The data has clusters of various shapes and different densities. Obstacles analogous to 7 highways or rivers and 2 triangle-like lakes split every cluster. Figure 13(a) shows the results of 8 clusters found by DBRS, which does not consider the presence of obstacles. Figure 13(b) shows 28 clusters found by DBRS_O, which does consider the presence of obstacles.

Table 1 shows the results of scalability experiments for DBRS_O and DBRS. The elapsed time is measured while varying the size of datasets from 25k to 200k and the number of the obstacles from 5 to 20. Each obstacle has 10 vertices, so the number of obstacle vertices varies from 50 to 200. The second column of Table 1 gives the run time for DBRS. The remaining columns list the run times for DBRS_O in the presence of obstacles. The run times for DBRS are slightly less than those for DBRS_O. For DBRS_O, the run time increases slightly with the number of obstacles for most datasets. However, for the datasets with 150k and 200k points and 5 obstacles, the run time is more than for the same datasets with 10 to 20 obstacles. The run time is affected by several factors: the size of the dataset, the number of region queries, the densities of the neighborhoods that are examined, and the number and location of obstacle vertices. For the above two cases, since other factors are fixed or slightly changed, the density of neighborhoods may be the major factor increasing the run time. If more points from denser clusters are examined, then more time is required to find the intersections with existing clusters, to merge their neighborhoods into the existing clusters and to remove the false neighbors.

16

(a) DBRS

(b) DBRS_O

Fig. 13. Clustering results with and without obstacles

Figure 14 shows the number of clusters discovered by DBRS and DBRS_O in

datasets with sizes ranging from 25k to 200k with 20 obstacle vertices. In Figure 14, the number of clusters discovered by DBRS ranges from 7 (in the 75k dataset) to 21 (in the 200k dataset) while the number of clusters discovered by DBRS_O ranges from 27 (in the 75k dataset) to 54 (in the 200k dataset), which indicates that when obstacles are

17

present, DBRS is less accurate than DBRS_O, as expected since it is unaware of the obstacles.

0 Obstacles (DBRS)

5 Obstacles (50 Vertices)




25k 18.5 20.2 20.9 21.4 21.9 50k 40.4 44.0 44.4 45.1 45.6 75k 54.0 60.5 61.8 62.1 62.6

100k 85.3 95.1 96.4 97.9 99.1 125k 96.3 107.8 108.4 109.5 111.1 150k 117.4 130.1 125.3 126.8 129.3 175k 136.7 151.0 152.8 154.6 157.1 200k 272.4 302.9 291.6 293.6 295.9

Table 1. Elapsed time for DBRS and DBRS_O (seconds)

0

10

20

30

40

50

60

25000 50000 75000 100000 125000 150000 175000 200000

Size of Datasets

# of

Clu

ster

s D

isco

vere

d

DBRSDBRS_O

Fig. 14. Number of Clusters Discovered

We also compared the runtime of DBRS_O, AUTOCLUST+, and DBCLuC*.

AUTOCLUST+ was the only implemented system available for comparison. We re-implemented DBCLuC as DBCLuC* because its authors lost their code. Since both DBRS and DBCLuC are descendants of DBSCAN, we were able to reuse portions of our code including the use of SR-trees in DBCLuC*. For the same input, the three algorithms found identical clusters. Figure 15 shows that AUTOCLUST+ and DBCLuC* are slower than DBRS_O. In fairness, the transfer time between AUTOCLUST+'s graphical user interface and its algorithm modules may increase its runtime to a certain degree. Also, although we did our best to implement DBCLuC* efficiently, we may not have matched the performance of the original implementation. Crucially, our DBRS_O had no advantage over DBCLuC due to the implementation, since all low-level functions were the same. We did not compare with COD_CLARANS, because the code was unavailable to us. But previous research has demonstrated that AUTOCLUST+ has a faster run time and better clustering results than COD_CLARANS [6].

18

0

2000

4000

6000

8000

10000

12000

25k

50k

75k

100k

125k

150k

175k

200k

Size of datasets

Tim

e (s

econ

ds)

DBRS_O

AUTOCLUST+

DBCLuC*

Fig. 15. Run time comparison among DBRS_O, AUTOCLUST+, and DBCLuC

Another series of experiments assessed the effectiveness of using Theorem 1 as a

technique for eliminating region queries for neighborhoods that have no crossing obstacles. Table 2 lists the number of separated neighborhoods and the number of obstructed neighborhoods with respect to varying dataset sizes and numbers of obstacle vertices. An obstructed neighborhood is a neighborhood that intersects at least one obstacle. A separated neighborhood is an obstructed neighborhood that is separated into two or more connected regions by obstacles. Table 2 shows that fewer than half of the obstructed neighborhoods are separated in our datasets. This fraction decreases as the number of obstacles increases. Thus, applying Theorem 1 avoids at least half of the effort and more than half for large numbers of obstacles.

5 Obstac(50 Verti

Separated O

25k 125

50k 148

75k 195

100k 174

125k 252

150k 224

175k 271

200k 491

Table 2. Number of separated and obstructed neighborhoods

les ces)




bstructed Separated Obstructed Separated Obstructed Separated Obstructed

298 139 445 154 577 162 632

504 165 813 174 1002 185 1121

527 247 874 257 1068 262 1205

643 198 1013 245 1392 257 1531

711 307 1157 351 1540 369 1690

599 275 978 319 1361 326 1479

724 330 1155 358 1546 364 1708

1771 569 2743 621 3296 629 3597

19

We also tested DBRS_O on a real dataset, consisting of the populations of province of Saskatchewan from the 1996 census of Canada. The population and dwelling counts are provided by individual postal codes. The postal codes were transformed to longitude and latitude using GeoPinPoint [18]. There were 11530 records for testing. We set Eps to 0.2 and MinPts as 5. It took DBRS 3.52 seconds to find 15 clusters using 618 region queries with no obstacles. Then we used highway as obstacles with same Eps and MinPts. There are 32 highways with 177 vertices. It took DBRS_O 3.90 seconds to find 31 clusters using 673 region queries. Figure 16 shows the clustering results. Figure 16 (a) shows the original data distribution. Figure 16 (b) and (d) are the clustering results of DBRS for cities of Regina and Saskatoon when highways are not considered to be obstacles. The population of each city is dense enough to be one big cluster. If the highways are considered as obstacles, the big clusters are divided into several small clusters, as shown in Figure 16 (c) and (e).

5. Conclusion

In this paper, we proposed a new constrained spatial clustering method called DBRS_O, which clusters two dimensional spatial data in the presence of obstacle constraints. Without doing any preprocessing, DBRS_O processes the constraints during the clustering process. For obstacles with many short edges, these features make DBRS_O more suitable than other algorithms. Additionally, DBRS_O can find clusters with arbitrary shapes and varying densities. Our experiments also show that DBRS_O works very well on both synthetic and real datasets.

The method proposed in this paper is based on DBRS. In our future research, we will extend similar ideas to other density-based methods, such as DBSCAN, and grid-based methods, such as STING[19]. Currently, some inefficiencies exist in the processing in the neighborhoods containing multiple intersecting, non-crossing obstacles due to the complexity of checking for intersections among the obstacles. Thus, DBRS_O might not be suitable for datasets with many tiny obstacles or with intersecting obstacles where the intersection occurs close to the obstacle vertices. This inefficiency might be reduced by using more efficient polygon operations.

Acknowledgements

We thank Vladimir Estivill-Castro and Ickjai Lee for lending us their implementation of AUTOCLUST+. We also thank Jörg Sander for providing us with the code for DBSCAN.

20

(a)

(b) Regina (c) Regina with highways as obstacles

(d) Saskatoon (e) Saskatoon with highways as obstacles

Fig. 16. Clustering results for Saskatchewan population data

21

22

References:

[1] A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng, Constraint-Based Clustering in Large Databases, Proc. 2001 Intl. Conf. on Database Theory (2001) 405-419.

[2] J. Han, L. V. S. Lakshmanan, and R. T. Ng, Constraint-Based Multidimensional Data Mining, Computer 32(8) (1999) 46-50.

[3] X. Wang and H. J. Hamilton, DBRS: A Density-Based Spatial Clustering Method with Random Sampling, Proc. of the 7th PAKDD (2003) 563-575.

[4] A.K.H. Tung, J. Hou, and J. Han, Spatial Clustering in the Presence of Obstacles, Proc. 2001 Intl. Conf. On Data Engineering (2001) 359-367.

[5] R. Ng, and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining, Proc. of 1994 Intl. Conf. on Very Large Data Bases, (1994) 144-155.

[6] V. Estivill-Castro and I. J. Lee, AUTOCLUST+: Automatic Clustering of Point-Data Sets in the Presence of Obstacles, Proc. of Intl. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, (2000) 133-146.

[7] V. Estivill-Castro and I. J. Lee, AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets, Proc. of the 5th Intl. Conf. On Geocomputation, (2000) 23-25.

[8] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry: Algorithms and Applications, Springer-Verlag, (1997).

[9] O. R. Zaïane, and C. H. Lee, Clustering Spatial Data When Facing Physical Constraints, Proc. of the IEEE International Conf. on Data Mining, Maebashi City, Japan, (2002) 737-740.

[10] C. H. Lee, Density-Based Clustering of Spatial Data in the Presence of Physical Constraints, Master Thesis, University of Alberta, Canada (2002).

[11] M. Ester, H. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. of 2nd lntl. Conf. on Knowledge Discovery and Data Mining, (1996) 226-231.

[12] N. Beckmann, H-P. Kriegel, R. Schneider, and B. Seeger, The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles, SIGMOD Record, vol. 19, no. 2, (1990) 322-331.

[13] N. Katayama and S. Satoh, The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries, Proc. of ACM SIGMOD, (1997) 369-380.

[14] T. H. Cormen, C. E. Leiseron, R. L. Rivest, and C. Stein, Introduction to Algorithms, The MIT Press, (2001).

[15] M. Egenhofer, A Formal Definition of Binary Topological Relationships, Third International Conference on Foundations of Data Organization and Algorithms (FODO), Paris, France, W. Litwin and H. Schek (eds.), Lecture Notes in Computer Science, Vol. 367, Springer-Verlag, (1989) 457-472.

[16] J. Arvo, (ed.). Graphics Gems II. Academic Press, Boston (1991). [17] http://www.cs.uregina.ca/~wangx [18] GeoPinPoint. http://www.dmtispatial.com/geocoding_software.html [19] W. Wang, J.Yang, and R. Muntz, STING: A Statistical Information Grid

Approach to Spatial Data Mining, Proc. of 23rd VLDB, (1997) 186-195.

http://www.cs.ualberta.ca/~zaiane/postscript/icdm02-3.pdf

http://www.cs.ualberta.ca/~zaiane/postscript/icdm02-3.pdf

http://www.dmtispatial.com/geocoding_software.html

1. introduction - university of calgary in alberta · 2014-10-17 · keywords: spatial clustering;...

Documents