discovering regular groups of mobile objects using incremental clustering … · 2015-01-22 ·...

�Abstract—As technology advances, detailed data on the

position of moving objects, such as humans and vehicles is available. In order to discover groups of mobile objects that usually move in similar ways we propose an incremental clustering algorithm that clusters mobile objects according to similarity of their movement patterns. The proposed clustering algorithm uses a new, "data-amount-based" similarity measure between mobile trajectories. The clustering algorithm is evaluated on two spatio-temporal datasets using clustering validity measures.

Index Terms—Clustering, Mobile objects, Spatio-temporal data mining.

I. INTRODUCTION

ith technological progress, detailed data is available on the location of moving objects at different times (e.g.,

humans and vehicles), either via GPS technologies, mobile computer logs, or wireless communication devices. This creates an appropriate basis for developing efficient new methods for mining the movement patterns of moving objects.

Spatio-temporal data can be used for many different purposes. The discovery of patterns in spatio-temporal data, for example, can greatly enhance the knowledge in fields such as animal migration analysis, weather forecasting, and mobile marketing. Clustering spatio-temporal data can also help in social groups' discovery, which is used in tasks like shared data allocation, targeted advertising, and personalization of content and services.

Previous work on mining spatio-temporal data includes querying data using special indexes for efficient performance, recognizing trajectory patterns, and clustering trajectories of closely moving objects as 'moving micro-clusters'. Extensive research has been done on spatial data and temporal data separately. Periodicity has been studied only with time series databases. The field of spatio-temporal mining is relatively young, and requires much more research.

The goal of this work is to discover regular groups of moving objects. In order to achieve this goal we first use a compact representation of a spatio-temporal trajectory and

Manuscript received October 28, 2007. This work was supported in part by Deustche Telekom AG Labs.

define an algorithm for building it. Then we define a new data-amounts-based similarity measure between trajectories according to proximity of trajectories in time and space. This measure allows the discovery of groups that have similar spatio-temporal behavior on a regular basis. We cluster objects into groups. The objects are clustered according to similarity between their periodic movement patterns. Finally we evaluate the proposed algorithms by conducting experiments on both synthetic and real data streams.

II. RELATED WORK

Clustering is a mature data mining field that we use in this work with some variations in order to fit it to the spatio temporal environment. Jain et al. [1] present a literature review on the subject of clustering. According to their review, a clustering task involves the phases of pattern representation, definition of a pattern proximity measure, clustering or grouping, data abstraction, and assessment of the output. Grouping can be done in a hard way, where each object is assigned to only one cluster, or in a fuzzy way, where each object can have different membership grades in each cluster. The clustering can be implemented by hierarchical algorithms that produce a nested series of merging or splitting clusters, based on a similarity criterion, or by partitioning algorithms that identify the partition, which optimizes a clustering performance criterion.

Clustering validity assessments are usually objective and are performed to determine whether the output is meaningful, and cannot have occurred by chance. Statistical validation methods include external assessment of validity, comparing the structure to an a priori structure, an internal examination of validity, checking whether the structure is intrinsically appropriate for the data, and a relative test, which compares two structures and measures their relative merits.

There exist several recent publications on clustering moving objects and trajectories. The problem of clustering moving objects is studied by Li et al. [2] who use moving micro-clusters (MMC) for handling very large datasets of mobile objects. A micro-cluster denotes a group of objects that are not only close to each other at current time, but are also likely to move together for a while. In principle, those moving micro-clusters reflect some closely moving objects,

Discovering Regular Groups of Mobile Objects Using Incremental Clustering

Sigal Elnekave, Mark Last, Oded Maimon, Yehuda Ben-Shimol, Hans Einsiedler, Menahem Friedman, Matthias Siebert

W

978-1-4244-1799-5/08/$25.00 ©2008 IEEE

PROCEEDINGS OF THE 5th WORKSHOP ON POSITIONING, NAVIGATION AND COMMUNICATION 2008 (WPNC’08)

197

naturally leading to high quality clustering results. The authors of [2] propose incremental algorithms to keep the moving micro clusters geometrically small by identifying split events when the bounding rectangles reach some boundary and by using knowledge about collisions between the MMCs (splitting or merging MMCs when those events occur). In experiments conducted on synthetic data with the K-Means as the generic algorithm used in micro-clustering, MMCs showed improvement in running times compared to NC (normal clustering), though with a slight deterioration in performance. However, MMCs may only help in finding groups that move together for a certain continuous period of time. They are less useful for the task of discovering groups of objects with similar movement patterns, which move together occasionally rather than constantly.

The problem of trajectory clustering is also considered by Nanni et al [3]. They propose clustering trajectory data using density-based clustering, based on the distance between trajectories. Their OPTICS system uses reachability distance between points and presents a reachability plot showing objects ordered by visit times (X) against their reachability measure (Y), allowing the users to see the separation to clusters in order to decide on a separation threshold. The authors of [3] propose to cluster patterns by the temporal focusing approach to improve the quality of trajectories. Some changes to OPTICS are suggested, by focusing on the most interesting time intervals instead of examining all intervals, where the interesting intervals are those with the optimal quality of the obtained clusters. A comparison between K-Means, three versions of hierarchical agglomerative clustering, and the trajectory version of OPTICS show that OPTICS improves purity, with a decrease in completeness.

The SCUBA algorithm of Nehme and Rundensteiner [4] is proposed for efficient cluster-based processing of large numbers of spatio-temporal queries on moving objects. The authors describe an incremental cluster formation technique that efficiently forms clusters at run-time. Their approach utilizes two key thresholds, distance and speed. SCUBA combines motion clustering with shared execution for query execution optimization. Given a set of moving objects and queries, SCUBA groups them into moving clusters based on common spatio-temporal attributes.

To optimize the join execution, SCUBA performs a two-step join execution process by first pre-filtering a set of moving clusters that can produce good results in the join-between moving clusters stage and then proceeding with the individual join-within execution on those selected moving clusters. Experiments show that the performance of SCUBA is better than the traditional grid-based approach where moving entities are processed individually.

Anagnostopoulos et al. [5] use Minimum Bounding Rectangles (MBRs) for approximating trajectories, while defining the distance between two trajectory segmentations s(Ti) and S(Tj) at time t as the distance d between the minimum

bounding rectangles at time t, where )),(( tTSP is the projection of the segment of trajectory T at time t on the x axis. Formally:

),()),(),(( min)),(()),((

ji

tTjsPxjtTsPx

ji xxdtTsTsdii

��

� (1)

Finally, the distance between two segmentations is the sum of the distances between them at every time instant:

��

�

�1

0)),(),(())(),((

m

tjiji tTsTsdTsTsd

(2) According to [5], the distance between the minimum

bounding rectangles is a lower bound of the original distance between the raw data, which is an essential property for guaranteeing correctness of results for most mining tasks.

III. AN ALGORITHM FOR BUILDING AN MBB-BASED TRAJECTORY REPRESENTATION

In this paper, we define a periodic spatio-temporal trajectory as a series of data-points traversed by a given moving object during a specific period of time (e.g., one day). Since we assume that a moving object behaves according to some periodic spatio-temporal pattern, we have to determine the duration of each spatio-temporal sequence (trajectory). Thus, in the experimental part of this work, we assume that a moving object repeats its trajectories on a daily basis, meaning that each trajectory describes an object movement during one day. In a general case, each object should be examined for its periodic behavior in order to determine the duration of its periodicity period. The training data window is the period used to learn the object's periodic behavior based on its recorded trajectories (e.g., daily trajectories recorded during one month).

Similar to [5], we represent a trajectory as a list of minimal bounding boxes. A minimal bounding box (MBB) represents an interval bounded by limits of time and location. By using this structure we can summarize close data-points into one MBB, such that instead of recording the original data-points, we only need to record the following six elements:

).,max(.).,min(.

).,max(.).,min(.).,max(.

).,min(.

maxmax

minmin

maxmax

minmin

maxmax

minmin

tmimtitmimti

ymimyiymimyixmimxixmimxi

��

��

(3)

Where i represents an MBB, m represents a member in a box, x and y are spatial coordinates, and t is time. Summarizing a spatio-temporal dataset that records locations of multiple objects at high frequency (e.g., each second) can significantly reduce running times of data mining algorithms.

We have developed a new algorithm for summarizing a


198

trajectory and setting the MBB bounds. As opposed to earlier MBB-based representations, we add other properties to the standard MBB-based representation that improve our ability to perform operations on the summarized data, like measuring similarity between trajectories or discovering exceptional data points. The additional properties are calculated by aggregation methods:

i.p = aggregation(� m � i, m.state) (4)

Where p stands for the value of a property variable in a minimal bounding box i, m represents a member in a box, and state is the data-point property that is being aggregated. In our algorithm, p represents the number of data points (data#) that are summarized by a given MBB:

i.data# = count(� m � i, 1) (5)

Therefore when segmenting the original trajectories into MBBs we also count the amount of data points summarized by each MBB.

As can be seen from the time bounds in Equation (3), we use MBBs having irregular, rather than constant, time intervals. Constant time intervals may facilitate processing operations like measuring similarity between trajectories, but at the price of setting forced bounds. Blurring the real data bounds may cause the separation of close data points and the union of distant data points. Since we would like the preprocessing stage to maintain as much precision as possible, we chose to use irregular time bounds.

A periodic (daily) trajectory of an object is identified by an object ID O and a date D, and it can be stored as a list of nMBBs:

{O1, D1, [t1m,t1M,x1m,x1M,y1m,y1M,N1],[t2m,t2M,x2m,x2M,y2m,y2M,N2]..,

[tnm,tnM,xnm,xnM,ynm,ynM,Nn]} (6)

where t represents time, x and y represent coordinates, m is used for minimal and M for maximal, and N represents the amount of data points belonging to each MBB. Figure 1 demonstrates an object's trajectory and its MBB-based representation for a given period.

Fig. 1. An object's trajectory

Incoming data-points update the MBB-based representations in the order of their arrival times. Therefore, the minimal time bound of the first MBB is the time of the earliest data-point in the trajectory and the maximal time

bound of a given MBB is extended until the time or the space distance between the subsequent data-point and the maximal bounds of that MBB reaches some pre-defined segmentation thresholds. When a threshold is exceeded in at least one of the dimensions, a new minimal bounding box is created with the time of the subsequent data-point as its minimal time bound. The larger the threshold is, the more summarized the trajectories are, meaning that we increase the efficiency of the next mining stages (shorter running times) while potentially decreasing their accuracy.

We chose to set time thresholds in order to limit the time range of a given MBB. By removing this threshold we allow the existence of MBBs that span a long period of time even though they may contain a very limited amount of location data on a given object.

Notice that there are two cases for summarizing (segmenting) trajectories:

1. Summarizing raw data collected during a single period (e.g. one day) into one segmentation.

2. Summarizing raw data collected during multiple periods (e.g. 30 days) into one segmentation.

Both cases are summarized in the same manner. The second option leads to coarser summarization since more data is summarized into a single segmentation.

We present an enhanced incremental algorithm for representing an object trajectory as a set of MBBs from a spatio-temporal dataset D covering object movements during a predefined period (e.g., 24 hours).

The algorithm processes each data point in the data stream and inserts it into an existing MBB as long as its distance from the MBB bounds are within the thresholds defined as the algorithm parameters; otherwise it creates a new MBB. In our previous work [6] we analyzed the selection of the threshold and its effect on the summarization resolution.

The "lastMBB" function returns the MBB with the maximal (latest) time bounds in the trajectory, the "addMBB" function initializes a new MBB in the trajectory with bounds and properties updated by the first incoming data-point (on the first arrival, the minimum and the maximum are equal to the data-point values), and the "addPoint" function updates an existing MBB properties (bounds and data amount) according to the inserted data-point.

Input: a spatio-temporal dataset (D) that describes object movements along a period of time, a threshold of x and y distances and of time duration of an MBB. Output: a new trajectory (T) Building an object's trajectory: item� D[1] T.addMBB(item) --First item updates first MBB For each item in D --Except for first item while(|item.X-T.lastMBB.maxX|<XdistThreshold and |item.Y-T.lastMBB.maxY|<YdistThreshold and |item.T-T.lastMBB.maxT|<durationThreshold) T.lastMBB.addPoint(item) --Insert into current MBB T.addMBB(item) --Create MBB when out of thresholds

Y

X T(hours)


199

For example, when summarizing the cars dataset we read the data describing each car separately in the order of its sampling times. We summarize the location records into a single MBB as long as they are close enough to each other (do not exceed some pre-determined distance threshold from the MBB's current bounds), when a record is far enough from its earlier records, we start summarizing it into a new created MBB.

The computational time complexity of processing n data-points is )(nO . In the end of this preprocessing stage we obtain a set of non overlapping MBBs that we refer to as “trajectory segmentation”.

IV. SIMILARITY BETWEEN TRAJECTORIES

In the following section we describe algorithms for clustering spatio-temporal items, where our items are mobile objects. In order to cluster such items we have to define a similarity measure that will enable the discovery of similar objects.

We define the similarity between two trajectories as the sum of the similarities between the time overlapping MBBs, divided by the product of the amount of MBBs in each of the compared trajectories, where trajectories may differ in their MBB amounts. The two compared trajectories are described as shown in Figure 2.

Fig. 2. Measuring similarity between two trajectories. Arrows represent minimal distances between two overlapping MBBs.

We suggest a new similarity measure for measuring similarity between two overlapping MBBs, based on the similarity between segments as described in [5]. If we treat each MBB as a segment we can use the following formula, where tm is the time when the two MBBs start to overlap, tn is the time when their overlapping ends and rangeT is the possible range of times for all mobile objects in the training window:

sim(MBB(Ti),MBB(Tj)) = minDsim(MBB(Ti),MBB(Tj))· |tm–tn|/rangeT) (7)

The minimal distance and the times tm and tn are described in Figure 3.

Fig. 3. A. Times of overlapping between two MBBs; B. Minimal distance between two MBBs

We define the minimal distance (minDsim) between two MBBs as:

minDsim( ( ), ( ))

minDsim( ( ), ( )) minDsim( ( ), ( ))i j

i j i j

MBB T MBB TX MBB T MBB T Y MBB T MBB T

�

�(8)

where the minimal distance between two MBBs in x and y dimensions, respectively is calculated as the minimal distance between two MBBs in that dimension, or as zero if the two MBBs overlap in that dimension:

min min max max

minD( ( ), ( ))

1 max(0,(max( ( ). , ( ). ) min( ( ). , ( ). ))) /i j

i j i j

X MBB T MBB TMBB T X MBB T X MBB T X MBB T X rangeX

�

� �

(9)

min min max max

minDsim( ( ), ( ))

1 max(0,(max( ( ). , ( ). ) min( ( ). , ( ). ))) /i j

i j i j

Y MBB T MBB TMBB T Y MBB T Y MBB T Y MBB T Y rangeY

�

� �

(10)

rangeX and rangeY are the possible range of X and Y coordinates correspondingly for all mobile objects in the training window. Using the enhanced representation of trajectories we can improve the similarity measure between trajectories as follows. We multiply the minimal-distances measure (7) by the difference between the amounts of data points of the two compared MBBs (data#sim). Since each MBB summarizes some data points, the more data points are included in both of the compared MBBs, the stronger support we have for their similarity. Our "data-amount-based" distance is calculated as:

( ( ), ( ))

minDsim( ( ), ( )) /

data# sim( ( ), ( ))

i j

i j m n

i j

sim MBB T MBB T

MBB T MBB T t t TrangeMBB T MBB T

�

�

(11)

where the similarity between the amounts of data points that are summarized within two MBBs is:

data# ( ( ), ( )) min( ( ). #, ( ). #)i j i jD MBB T MBB T MBB T data MBB T data� (12)

1

2

12

A

B Minimal Distance

tm tn


200

The "data-amount-based" similarity measure for measuring similarity between two MBBs has time and space complexities of �1O , and it is computed as follows:

An algorithm for measuring the similarity between two trajectory segmentations has �O m time complexity in the

worst case (and �1O space complexity), where m is the maximal amount of MBBs in each of the two compared trajectories. The algorithm is described bellow:

where T1 and T2 are the two compared trajectories, MBB1 and MBB2 are the locations of the two currently compared MBBs in T1 and T2 respectively, the MBB# attribute returns the amount of MBBs in a given trajectory, and the MBB-similarityfunction returns the similarity between the two given MBBs according to the previous pseudo code.

The similarity between two objects where each object has one periodic movement pattern can be calculated according to the previous algorithm (similarity between two trajectory segmentations).

In this paper we assume that each mobile object is represented by a single segmentation which is a summarization of its trajectories, or in other words, its movement pattern.

In the general case, where each object may have more than one movement pattern, the similarity between two objects is calculated as the sum of similarities between each two possible movement patterns (segmentations) divided by the product of the amounts of movement patterns in each of the objects (the number of comparisons). The algorithm is

described bellow:

where O1 and O2 are the two compared objects, T1 and T2 are the two currently compared movement patterns (segmentations) of O1 and O2 respectively, the trajectories-similarity function returns the similarity of the two compared segmentations according to the previous pseudo code and the trajectories# attribute returns the amount of trajectories that belong to a given object.

The algorithm's worst case time complexity is O(tm) (and O(1) space complexity), where t is the maximal amount of trajectories according to the compared objects O1 and O2, and m is the maximal amount of MBBs per trajectory.

For the task of clustering multi-pattern objects according to the similarity between their trajectories, as presented in the following sections, we need to compare the similarity between an object and a cluster centroid. Since our centroid is represented as segmentation, as also defined in the following sections, we compare two objects: one may have several segmentations, and the other (the centroid) is represented by only one segmentation. This reduces the worst case time complexity to O(tm), where t is the amount of segmentations that belong to the object and m is the maximal amount of MBBs in a segmentation.

V. REPRESENTING A CLUSTER'S CENTROID

We represent a cluster centroid as a segmentation, or in other words as a vector of MBBs. Each MBB is an interval that holds information about the upper and lower bounds in each one of the d-dimensions (in our case 2-D) of the location, lower and upper time bounds, and the amount of data-points that are summarized within the MBB.

As opposed to a movement pattern which is a set of non overlapping MBBs, a centroid is a set of MBBs that can overlap in case where several locations (trajectories) are allowed during the same time interval.

During the clustering of spatio-temporal items (trajectories of mobile objects) several similar items are inserted into each cluster. First, the cluster's centroid is initialized to the first inserted item, and after all items are clustered, the cluster centroids need to be updated.

We defined a new representation for a centroid as a vector

Input: two objects (O1,O2), each object has several movement patterns Output: the objects similarity measure Objects-similarity: T1=0 --Initialization of the first compared trajectory while (T1<O1.trajectories#) --For each trajectory of O1T2=0 --Initialization of the second compared trajectory while (T2< O2.trajectories# ) --For each trajectory of O2similarity += trajectories-similarity(O1(T1),O2(T2)) T2++ --Proceeding to the next trajectory of O2 T1++ --Proceeding to the next trajectory of O1return similarity/ O1.trajectories#/ O2.trajectories#

Input: two MBBs (M1,M2), possible ranges of x,y,t and data-amount (for normalization) Output: the MBBs similarity measure MBB-similarity:

)/)).,.min().,.(max(,0max(1 max2max1min2min1 xRangexMxMxMxMsimX ��)/)).,.min().,.(max(,0max(1 max2max1min2min1 yRangeyMyMyMyMsimY ��

)/)).,.max().,.(min(,0max(1 min2min1max2max1 tRangetMtMtMtMsimT ��

1 2# min( . #, . #)sim M data M data�return ((simX + simY) /2 * SimT * sim#)

Input: two trajectories (T1,T2)Output: the trajectories similarity measure Trajectories-similarity: MBB1=0, MBB2=0--initialization of the currently compared MBBs while (MBB1<T1.MBB# and MBB2< T2.MBB#) while (T1(MBB1).maxT < T2 (MBB2.minT) MBB1++ --While MBB1 before MBB2: move to next MBB in T1 while (T2(MBB2).maxT < T1 (MBB1).minT) MBB2++--While MBB2 before MBB1: move to next MBB in T2 similarity += MBB-similarity(T1(MBB1),T2 (MBB2)) if (T1(MBB1).maxT< T2 (MBB2).maxT) MBB1++ else MBB2++ --Proceeding to the next MBB after an overlap return similarity/ T1.MBB#/ T2.MBB#


201

of intervals, instead of the common representation of a centroid as a vector of numeric values. The common updating method calculates a cluster centroid as a vector of averages, where the average is calculated according to the values of the vectors that belong to the cluster. We cannot use this averaging technique with our suggested representation of cluster centroids, since averaging interval bounds will lead to an invalid interval, with bounds that are not the real limits of the interval. Therefore we define a new algorithm for updating the cluster centroids using a bounding technique instead of averaging.

We first sort all the MBBs that belong to items in a cluster. MBBs with larger intervals will appear first (sorted by time dimension, then by X dimension and finally by Y dimension). Then, each MBB is added into the cluster's centroid (represented as segmentation) in the following manner:

If the inserted MBB is within the scope of another MBB it will only update the amount of summarized data points in the existing MBB,

If the inserted MBB exceeds an existing MBB scope within an allowed distance (less than the pre-determined thresholds) it updates the amount of summarized data points in the existing MBB, and it also updates the exceeded bounds of the existing MBB. Otherwise, the inserted MBB is added as a new MBB.

The centroid updating algorithm is as follows:

where c is the current updated cluster, id returns the cluster number, the function empty returns true if the centroid of the cluster is empty, the function add adds an MBB to the updated centroid in new-centroids, the function within-MBB returns true if a given MBB is inside the bounds of another given MBB. tempMBB holds an MBB that is currently being checked for containing another MBB, the data# attribute holds the amount of summarized data-points within the given MBB, the function updateMBB updates the bounds and properties (data#) of a given MBB according to a given inserted MBB, and the function is-last returns true if there is

no subsequence MBB in a given centroid.

VI. CHOOSING A CLUSTERING ALGORITHM

In order to perform clustering we had to choose one clustering technique among the wide variety of existing algorithms,

According to Jain et al. [1] the partitioning K-Means algorithm has been used to cluster large data sets. The reasons behind the popularity of the K-Means algorithm are mainly its relatively low time and space complexities. K-Means time complexity is O(nkl), where n is the number of data-points, kis the number of clusters, and l is the number of iterations taken by the algorithm to converge. Typically, k and l are fixed in advance and so the algorithm has linear time complexity in the size of the data set. K-Means is also order-independent; for a given initial seed set of cluster centers, it generates the same partition of the data irrespective of the order in which the patterns are presented to the algorithm.

However, the K-Means algorithm is sensitive to initial seed selection and even in the best case, it can produce only hyper-spherical clusters.

Hierarchical algorithms are more versatile. But they have some disadvantages. Their time complexity is O(n2logn), and their space complexity is 2( )O n . This is because a similarity matrix of size n n� has to be stored.

We chose to use the K-Means algorithm mainly for its reduced time and space complexities. The decision was reinforced by the high cost of similarity measures in the spatio-temporal environment.

Segmentations are a summarized version of the original spatio-temporal data. In the task of clustering trajectories (or objects with a single trajectory), we can cluster tsegmentations (or objects), each contains up to m MBBs, with a time complexity of �O mtkl , where k is the number of clusters, and l is the number of iterations taken by the algorithm to converge. The algorithm's space complexity is �O k mt� , where each of the t trajectories requires space for up to m MBBs.

In the task of clustering objects where each object has several segmentations, we can cluster s objects (each has up to t segmentations with up to m MBBs) with time complexity of

�O tmskl and space complexity of �O k mts� .

VII. CLUSTERING OBJECTS IN ORDER TO RECOGNIZE REGULAR GROUPS OF OBJECTS

Object clusters contain objects that have similar trajectories or movement patterns. Objects in the same cluster are frequently (in most of the time intervals) close in space. An objects cluster represents a group of moving objects that use similar trajectories but not necessarily move together in the same trajectory all the time.

Input: previous centroids, current clusters Output: new centroids Update-centroids: For each c in current-clusters c.sortByArea () For each MBB in c if new-centroids[c.id].empty() --If cluster is empty insert first MBB then new-centroids[c.id].add(MBB) else tempMBB �new-centroids[c.id].firstMBB() --Otherwise check while(not MBB.withinMBB(tempMBB))

--Can the current centroid's MBB contain the new MBB? tempMBB �new-centroids[cluster.id].nextMBB()

--If no:check next MBB in the centroid if is-last(tempMBB)--After checking all centroid MBBs then new-centroids.addMBB(MBB) --Insert a new MBB break while if MBB.withinMBB(tempMBB) --Found a bounding MBB then tempMBB.data#= tempMBB.data# + MBB.data# else tempMBB.updateMBB(MBB)--Found partly bounding


202

In order to recognize groups of mobile objects that are often close in space to each another, we can operate in one of the two described options:

We can cluster mobile objects according to similarity between their trajectories in order to find groups of objects with similar trajectories. We first summarize the raw data using our suggested algorithm as described in section III. Each object's data-points (within the training window) are represented as one segmentation, assuming one movement pattern per each object. Then we cluster the objects according to their summarized trajectories (segmentations). As explained in section VI, this requires �O tmkl time complexity when clustering t objects, each contains one summarized trajectory with up to m MBBs, where k is the number of clusters, and l is the number of iterations taken by the algorithm to converge.

We can cluster the objects after abnormal behaviors were removed from the movement patterns or the summarized trajectories. By removing exceptions we significantly decrease the size of input for the objects clustering algorithm, and therefore we improve its efficiency, but at the cost of removing data-points from the input and therefore risking with a reduction in correctness.

We can detect an exceptional MBB by its "data amount" property that records the amount of data points that are summarized within an MBB during the training window. If an object is frequently found in this location at this time, the "data amount" of the MBB will be a large number, but if the object rarely reaches this location at this time, the "data amount" of the MBB will be low. We can use this property for recognizing sparse or exceptional MBBs.

VIII. EXPERIMENTS

A. Evaluation methods 1) Dunn index

The Dunn index measures the overall worst-case compactness and separation of a clustering, with higher values being better. [7]

max

min

DDDI �

(13)

where Dmin is the minimum distance between any two objects in different clusters (seperation) and Dmax is the maximum distance between any two items in the same cluster (homogeneous). The numerator captures the worst-case amount of separation between clusters, while the denominator captures the worst-case compactness of the clusters.

2) Rand index Since clustering is an unsupervised machine learning

technique, there is no set of correct answers that can be compared to the clustering results. We can, though, generate data sets by some mechanism that will help building the correct partition into groups, and then test the clustering

results using the Rand index that measures clustering accuracy compared to the ground truth. This index performs a comparison of all pairs of objects in the data set after clustering. "Agreement"(A) is a pair that is together or not together in both the real and the measured clusters, "disagreement"(D) is the opposite case.

The Rand index is computed as: [8]

DAARI �

� (14)

B. Obtaining spatio-temporal data 1) The INFATI dataset

We only found one available real spatio-temporal dataset that fits to our purposes, mainly due to privacy issues. The INFATI dataset we used [9] is a real dataset that contains information about a group of cars and their locations during a period of three weeks. The INFATI data contains GPS log-data from 11 cars. This data was collected during December 2000 and January 2001. All cars were driving in the municipality of Aalborg, which includes the city of Aalborg, suburbs, and some neighboring towns. The collected data encompasses a range of 2017 KM × 3263 KM ( x × y ).

For more than a month, the movement of each car was registered in the car’s database. When a car was moving, its GPS position was sampled every second. The GPS positions were stored in the Universal Transverse Mercator (UTM 32) format. No sampling was performed when a car was parked.

2) The TRAJECTORIES dataset In order to better control the data and to allow the

evaluation of clustering results by a comparison with a predefined ground truth, we generated the synthetic TRAJECTORIES dataset. By setting movement formulas by ourselves we could design the trajectories and make sure that objects belong to their intended group, and that they use similar trajectories along several time periods. We used movement formulas of the form:

noisevtyynoisevtxx

noiseratett

y

x

��

��

101

101

01

(15) where x0, y0 are the previous coordinates, x1,y1 are the current coordinates, t0 is the time when the previous data point was sampled, and t1 is the time when the current point is sampled.Vx, Vy are the velocities of the movement on x and y axes (that change along the movement), and rate is the time between samplings. The data is asynchronous. noise is a number randomly chosen from a range that is defined as 15 percents of the data range in the corresponding dimension.

The first 13 objects were tracked along 45 days (between day 1 and day 45) and the next 10 objects were tracked along 25 days (between day 1 and day 25). Each object was sampled at least 35 times during each day, and between 950 and 1050


203

samples in total. The proposed algorithm is evaluated on these two datasets

as described in the following sections. The synthetic TRAJECTORIES dataset was used for testing

the algorithms for discovering regular groups. In order to enable measuring the clusters' correctness we created the object's 45 trajectories with a clear distribution into three groups with similar movement patterns: (5,10,15,20,25,30,35,40,45), (1,2,4,6,7,9,11,12,14,16,17,19,21,22,24,26,27,29,31,32,34,36,37,39,41,42,44), and (3,8,13,18,23,28,33,38,43), as can be seen in Figure 4.

Fig. 4. The TRAJECTORIES dataset in a 3-D visualization

C. Evaluation experiments

For the evaluation of our algorithm for finding regular groups (using our enhanced spatio-temporal clustering algorithm), we clustered mobile objects according to their trajectories (assuming one movement pattern per each object) using both the real-data INFATI dataset, and the synthetic TRAJECTORIES dataset. We first summarized each mobile object's data-points along the training window into one segmentation, we then removed exceptions from the summarized trajectories, and finally we clustered the objects according to similarity between their trajectories, using our "data-amount-based" similarity measure.

Simulations on the INFATI data were set as described in Table 1. The dataset holds information on locations of 11 cars along a training window of two months.

TABLE IPARAMETERIZATION OF THE INFATI DATASET

bounds (for trajectories and centroid segmentation)

2.0))min()(max()( �� DDDstdevbound

iterations amounts(i) 30clusters amount (k) 2, 3, 4, 5, 6, 7, 8exceptions bound

2-datapointsofamount10kdatapointsofamount

��

TABLE II PARAMETERIZATION OF THE TRAJECTORIES DATASET

B (for trajectories segmentation and for centroid segmentation)

4.0))min()(max()( �� DDDstdevbound

K-Means iterations amounts(i)

20

K-Means clusters amount (k)

2, 3, 4, 5, 6, 7

exceptions bound2-datapointsofamount10k

datapointsofamount��

��

Notice that tuning the k parameter is beyond the scope of this paper. Existing methods [10] try to estimate the right amount of groups in order to optimize k.

Simulations on the TRAJECTORIES data were set as described in Table 2. The dataset holds information on locations of 45 mobile objects along a training window of one day. Each object belongs to one of three groups according to its movement pattern, and to one of five groups according to its abnormal behavior.

Using the groups recognized by visualization techniques as ground truth led to the following clustering results in the task of clustering the INFATI dataset's objects according to similarity between their movement patterns: 46.8% average Rand index, average Dunn index of 0.99, and average run time of 51 seconds.

The obtained correctness of the results according to the Rand index is relatively low, mainly due to the fact that our ground truth depends on decisions made by eyesight, which can be misleading. Our clustering algorithm may have found more accurate groups than the groups of "ground truth" that were found by visualization techniques and may be inaccurate.

This problem is avoided when using the synthetic TRAJECTORIES dataset that was created with a clear distribution of its 45 moving objects (each has one moving pattern) into three groups with similar movement patterns.

Simulations on the synthetic TRAJECTORIES dataset led to the following clustering results when clustering objects according to similarity within their movement patterns: 87.2% average Rand index, average Dunn index of 0.94, and average run time of 9 seconds. As we expected, the correctness is much higher according to the Rand index.

IX. CONCLUSION

In this work we presented a new way for summarizing periodic spatio-temporal data, including a new similarity measure between summarized trajectories.

Our proposed incremental algorithm for clustering spatio-temporal items was shown to work well. We received high cluster's validity results for clustering regular groups using our enhanced clustering algorithm.


204

REFERENCES

[1] A.K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review", ACM Computing Surveys (CSUR), v.31(3), 1999, pp.264-323

[2] X. Li, J. Han, S. Kim, Motion-Alert: “Automatic Anomaly Detection in Massive Moving Objects”. ISI-2006, pp. 166-177

[3] M. Nanni, D. Pedreschi "Time-focused density-based clustering of trajectories of moving objects" JIIS , Special Issue on "mining spatio-temporal data", vol.27(3), 2006, pp. 267-289

[4] R. Nehme and E.A.Rundensteiner . “SCUBA: Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-Temporal Queries on Moving Objects”, EDBT 2006, pp. 1001-1019

[5] A. Anagnostopoulos, M. Vlachos, M. Hadjieleftheriou, E. Keogh, P.s. Yu "Global Distance-Based Segmentation of Trajectories". KDD 2006,pp. 34-43.

[6] S. Elnekave, M. Last, O. Maimon "A Compact Representation of Spatio-Temporal Data". To appear in ICDM Workshop on Spatial and Spatio-Temporal Data Mining 2007 (SSTDM07), IEEE, 2007

[7] J.C.Dunn "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters", J. Cybern,vol.3, 1973, pp. 32-57

[8] W.M.Rand "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association, vol. 66, 1971, pp. 846-850.

[9] C. S.Jensen, Lahrmann H., Pakalnis S., and Runge J. "The INFATI Data". Aalborg Univ., TimeCenter Technical Report. Available: http://www.cits.aau.dk/download/INFATI.pdf. 2004 ,pp. 1-10.

[10] D. Pelleg and A. Moore, "X-means: Extending K-means with efficient estimation of the number of clusters" Proc. 17th International Conference on Machine Learning, 2000, pp.727–734.


205

discovering regular groups of mobile objects using incremental clustering … · 2015-01-22 ·...

Documents