road-network aware trajectory clustering: integrating …lingliu/papers/2014/neat-tmc.pdf1...

1

Road-Network Aware Trajectory Clustering:Integrating Locality, Flow and Density

Binh Han, Ling Liu, Edward OmiecinskiCollege of Computing, Georgia Institute of Technology

{binhhan,lingliu,edwardo}@gatech.eduF

Abstract—Mining trajectory data has been gaining significant in-terest in recent years. However, existing approaches to trajectoryclustering are mainly based on density and Euclidean distancemeasures. We argue that when the utility of spatial clustering ofmobile object trajectories is targeted at road-network aware location-based applications, density and Euclidean distance are no longer theeffective measures. This is because traffic flows in a road networkand the flow-based density characterization become important fac-tors for finding interesting trajectory clusters. We propose NEAT�aroad-network aware approach for fast and effective clustering oftrajectories of mobile objects travelling in road networks. Our ap-proach carefully considers the traffic locality characterized by thephysical constraints of the road network, the traffic flow amongconsecutive road segments, and the flow-based density to organizetrajectories into spatial clusters in a comprehensive three-phaseclustering framework. NEAT discovers spatial clusters as groups ofsub-trajectories which describe both dense and highly continuousflows of mobile objects. We perform extensive experiments with mo-bility traces generated using different scales of real road networks.Experimental results demonstrate the flexibility of the NEAT systemand show that NEAT is highly accurate and runs orders of magnitudefaster than existing density-based trajectory clustering approaches.

Index Terms—trajectory clustering; road network trajectory;location-base applications

1 INTRODUCTIONRecent years have witnessed a rapidly growth in thefield of location-based services (LBSs) and applica-tions due to the pervasive use of GPS receivers andWiFi or location sensing technology embedded in mo-bile devices (e.g. cellular phones, automobiles). Withthe number of smartphones in use world wide reaches1.038 billion units in 2012 and is predicted to reach 2billion units by 20151, LBS revenue is forecasted toreach an annual global total of $13.5 billion by 20152,up from $4 billion in 20123. Ubiquitous GPS/WiFi-enabled mobile devices generate a huge amount oftrajectory data, which are sequences of time-orderedlocations of mobile objects. There has been a lot ofwork on collecting, storing, indexing and queryingtrajectories of mobile objects [1] [2] [3] [4] [5]. Werefer to the trajectories of mobile objects in a roadnetwork as MO trajectories. Clustering trajectories ofthese objects provides the most value and has a widerange of LBS applications. For example, the resultingclusters would help provide knowledge about trafficflows as well as dense areas in a road network. Suchknowledge is very useful for applications in vehicularad hoc network (VANET) [6] [7], traffic monitoring [8],

1. http://www.strategyanalytics.com/2. http://www.gartner.com/3. http://www.abiresearch.com/

transportation planning [9] and location-based adver-tising [10]. We briefly present below two interestingapplication scenarios which show the usefulness oftrajectory clustering and motivate us to study theproblem of clustering trajectories of mobile objectsmoving in a spatially constrained road network.

Public transit planning. The establishment of pub-lic transportation system always target the road net-work routes which can maximize the utilization ofpublic transportation vehicles. Since a bus runs alongconsecutive road segments, in order to make the bestuse of the bus service regarding the number of pas-sengers it can serve and to improve public transportconvenience, e.g. reduce the number of transit stopsfor passengers, knowing which routes with not onlyhigh density but also high continuity helps optimizerail/bus line and terminal arrangement.

Location based advertising on mobile devices.Mobile advertisers are trying to improve the matchingof user locations and marketing information. It wouldbe beneficial for local stores to send advertisements,e.g. special offers or discounts, to mobile devicestaking path in major traffic flows passing by theirstores. For example, consider when a store wants tosend out text messages with a discount coupon. Ifthe text message is sent to a group of people in anearby dense road segment which is also part of aroute with significant traffic flow leading to the store,it will better increase the chance that people receivingthe coupon will actually come to the store during theirtrips, compared to when the message is sent to a densearea but does not belong to the traffic flow passing bytheir store.

A straight forward solution to address this clus-tering problem is to adapt the traditional density-based clustering algorithms (e.g. DBSCAN [11] orOPTICS [12] - a variant of DBSCAN) to group sim-ilar MO trajectories. However, clustering trajectoriesas a whole does not take into account similar sub-trajectories since trajectories have various lengths.This is addressed by partial trajectory clustering. TheTraClus algorithm [13] is the representative methodfor clustering portions of a trajectory instead of thewhole trajectory. Specifically, TraClus is a two phaseclustering algorithm. In the partitioning phase, eachtrajectory is first examined sample by sample to iden-tify a sequence of characteristic points at which themoving object makes a sharp turn, called a rapidchange in direction, and then the trajectory is par-

http://www.strategyanalytics.com/

http://www.gartner.com/

http://www.abiresearch.com/

2

titioned into line segments by these characteristicpoints. The grouping phase performs a DBSCAN styleclustering on those line segments to find similar sub-trajectories. A drawback to this and most similar par-tial trajectory clustering approaches is that they onlyconsider distances in Euclidean space, while showreasonable performance for clustering trajectories ofobjects moving freely (e.g. the movement of animalsthrough a forest or the movement of hurricanes acrossan ocean), are inappropriate for clustering MO trajec-tories. We argue that clustering MO trajectories shouldtake into account the traffic locality characterized bythe spatial constraints of the underlying network,e.g. road segments and intersections, the movementbehavior of mobile objects as well as the traffic flowson consecutive road segments, in addition to mobileobject density and the road network proximity. Weillustrate our positional statements with the followingset of examples.

Example 1. We have a set of four objects movingalong the same road segment and produce four MOtrajectories as shown in Figure 1(a). In the context ofa road network, those objects have similar movementbehavior with respect to the road segment. Therefore,their trajectories should be grouped together regard-less of the difference in their specific movement onthe road segment. Hence, it is unnecessary to furtherpartition these trajectories, even though some rapidchanges in direction are found in those trajectories.

Example 2. Consider two trajectories TR1 and TR2

in Figure 1(b). TR1 describes an object moving onroad segment RS1, changing lanes, and then making aleft turn to road segment RS2. TR2 describes anotherobject moving straight from RS1 to RS3. If we usethe TraClus algorithm, to cluster the MO trajectories,TR1 will be partitioned into four line segments: A1A2,A2A3, A3A4, and A4A5, while TR2 consists of onlyone line segment which is too long to be groupedwith the segments of TR1 on RS1 (A1A2, A2A3, andA3A4). We argue that although there is no significantturning point found in TR2, we should still partitionit into two line segments fragments corresponding tothe two distinct road segments RS1 and RS3.

Example 3. In Figure 1(c), road segments RS1 andRS3 share a larger number of mobile objects than thatof road segments RS1 and RS2. Thus, when considertraffic continuity, traffic on road segment RS1 shouldbe merged with traffic on RS3 rather than RS2.

Example 4. We have three trajectories: TR1 and TR2

are on the same road segment, and TR3 is on anotherroad segment as shown in Figure 1(d). Using theEuclidean distance, TR3 is closer to TR2 than TR1,even though using the road network distance (eithersegment length based or travel time based), TR1 iscloser to TR2 than TR3.

In this paper, we present NEAT�a road NEtworkAware approach to Trajectory clustering. To the best ofour knowledge, NEAT is the first technique to addressthe MO trajectory clustering problem by taking intoaccount the continuity of movements restricted by

the underlying road network, the network proximityand the traffic flows among consecutive road seg-ments to organize MO trajectories into spatial clusters.The clusters discovered by NEAT are groups of sub-trajectories, which describe both dense and highlycontinuous traffic flows of mobile objects. A uniquefeature of the NEAT framework is the formulation ofthe three phase trajectory partitioning, merging andclustering process. Specifically, based on the obser-vations from the above examples, we identify threeimportant design guidelines. First, for a given set ofMO trajectories, the road intersections can be viewedas the initial partitioning points where trajectoriescan be split into atomic sub-trajectories called tra-jectory fragments. Second, the trajectory fragmentscorresponding to a road segment can be viewed as alocally dense cluster of objects involved in the givenset of trajectories. Third, trajectory fragments shouldbe clustered based on their continuity with regard tothe traffic flows on the consecutive road segments.In addition, the proximity measure in a road net-work space uses shortest path distance instead of Eu-clidean distance. We perform extensive experimentswith road network mobility traces generated usingdifferent scales of road network maps. Our experi-mental results demonstrate that the NEAT approachis highly efficient and accurate. It can run more thanthree orders of magnitude faster than existing density-based trajectory clustering approaches.

2 NEAT MODEL AND FRAMEWORK

We first describe a reference model for road networksand present the basic concepts and operations ofthe NEAT model. We end this section with a briefoverview of the NEAT three phase framework andan illustrative example.

2.1 Road-Network ModelA road network is represented by a single directedgraph G = (V, E), composed of the junction nodesV = {n0, n1, . . . , nN

} and directed edges E ={(sid, n

i

nj

)|ni

, nj

2 V}.An edge e = (sid, n

i

nj

) 2 E representing a roadsegment connecting two junctions n

i

and nj

in thereal road network. The listed order n

i

nj

indicatesthe direction from n

i

to nj

of the road segment.For road segments which have bidirectional lanes,we use edge e = (sid, n

i

nj

) and e0 = (sid, nj

ni

) todenote the fact that the road segment is bi-directionaland we label each edge with the corresponding roadsegment identifier sid. The length of a road segmente = (sid, n

i

nj

) is denoted by length(ni

nj

) .Let L(e) denote the set of adjacent edges of e =

(sid, ni

nj

) and Lni(e) denote the set of adjacent edges

of e, which connect to e at junction ni

. Hence, wehave L(e) = L

ni(e) [ Lnj (e). If n

i

is a dead-end nodeconnected by edge e, then L

ni(e) = �. If two edgesei

and ej

are adjacent, function I(ei

, ej

) will returnthe junction node (intersection) of these two edges. Aroute in the road network G is a network path e0e1...eksuch that e

i+1 2 L(ei

) (0 i < k).

3

(a) An example of themovement of four vehicleson a road segment

A1 A2

A3 A4

A5

TR1TR2

RS1

RS2

RS3

RS1

RS2

RS3

TR1

TR2

TR3

(b) Parts of two trajecto-ries on the same road seg-ment that are overlooked inTraClus

A1 A2

A3 A4

A5

TR1TR2

RS1

RS2

RS3

RS1

RS2

RS3

TR1

TR2

TR3

(c) Traffic on road segment RS1

should be merged with trafficon road segment RS3 ratherthan RS2

A1 A2

A3 A4

A5

TR1TR2

RS1

RS2

RS3

RS1

RS2

RS3

TR1

TR2

TR3

(d) An example showing thedifference between the Eu-clidean proximity and theroad network proximity

Fig. 1: Illustrative examples

We define a road network location of a mobile objectas a tuple of three elements: sid � the identifierof the road segment on which this object resides,the geometric coordinates (x, y) of the position ofthe object on the road segment sid, and the times-tamp t when the position is recorded, denoted byl = (sid, x, y, t). A road network location can alsobe represented by a tuple (sid, p, t) where p is theoffset of the location from the start junction of theroad segment identified by sid. We use the (x, y)coordinates to represent location in this paper due tothe popularity of geometric coordinates. We measurethe network proximity between two network locationsusing the network-based distance metric [14]. Let l

i

and lj

denote the network locations that belong to theedge e

i

= (sidi

, na

nb

) and the edge ej

= (sidj

, nc

nd

)respectively and e

i

6= ej

. The network distance be-tween l

i

and lj

is the length of the shortest pathbetween l

i

and lj

, denoted by distN

(li

, lj

). We usethe terms point and location interchangeably to referto a road network location and the terms junction,intersection and endpoint interchangeably to refer toa road junction. We use (sid, n

i

nj

) and ni

nj

inter-changeably to denote a segment connecting two endnodes n

i

and nj

when there is no confusion.

2.2 NEAT model

In this section, we define the basic concepts and op-erations of our road network aware trajectory modelwith illustrative examples.

For each mobile object, each of his/her tripswith a beginning location and a destination locationforms a trajectory. A trajectory, denoted by TR =(trid, l0l1...ln), is a time-ordered sequence of locationsl0, l1, ..., ln of an MO in the road network over timeand uniquely identified by a trajectory identifier trid.A subsequence of points in a trajectory forms a sub-trajectory. Note that in our model, the temporal infor-mation in a trajectory, i.e. the recorded timestamps,determines the order of locations in the trajectory.Therefore, the direction of movement of an object isalways preserved. For presentation convenience, wedo not explicitly mention the directions of movementin the definitions and figures used in the subsequentsections of the paper when no confusion occurs.

Definition 1. (t-fragment) Let TR = {trid, l0l1...ln}denote a trajectory consisting of n+1 points and triddenote the trajectory identifier. A t-fragment of TR,

denoted by tf = {trid, sid, lk

lk+m

}, represents a sub-trajectory l

k

lk+1...lk+m

consisting of m+1 consecutivepoints extracted from TR which lie on the same roadsegment sid, i.e. l

i

.sid = lj

.sid (8i, j : k i, j k +m, i 6= j).

There can be more than one t-fragment associatedwith a road segment for a given trajectory (in case themobile object travels on the road segment multipletimes during her trip).

Definition 2. (base cluster) Let T denote a set oftrajectories and e denote a road segment. A base clusterS with respect to e is a group of distinct t-fragments,each of these t-fragments belongs to a trajectory in Tand is associated with e. The base cluster S is formallydefined as follows:S = {tf

i

|TR(tfi

) 2 T , tfi

.sid = e.sid} whereTR(tf

i

) denotes the trajectory from which the t-fragment tf

i

2 S is extracted. The road segment eis called the representative of the base cluster S, andis denoted by eS . The base cluster S is said to beassociated with the road segment eS .

We call a trajectory which has t-fragments in a basecluster the participating trajectory of the base cluster.

Definition 3. (trajectory cardinality) The set of par-ticipating trajectories of a base cluster S is definedas: PTr(S) = {TR(tf

i

)|8tfi

2 S}. The cardinality ofPTr(S), denoted by |PTr(S)|, is called the trajectorycardinality of S.

Definition 4. (cluster density) The density of a base clus-ter S, denoted by d(S), is the number of t-fragments inS. Given B = {S0, S1, ..., SN

} is a set of base clusters,we call the base cluster with the highest density in Bthe dense-core of B, denoted by densecore(B).

Definition 5. (netflow) The netflow between two baseclusters S

i

and Sj

, denoted by f(Si

, Sj

), is thenumber of trajectories participating in both clusters:f(S

i

, Sj

) = |PTr(Si

) \ PTr(Sj

)|The function netflow between two base clusters

computes the number of common objects traveled onboth representative road segments eSi and eSj .

Definition 6. (f -neighborhood) Let B denote a set ofbase clusters, S

i

denote a base cluster and nu

denoteone endpoint of eSi . The f -neighborhood of S

i

wrt.nu

, denoted by Nf

(Si

, nu

), is the set of base clustersthat have at least one common participating trajectory,and is formally defined as: N

f

(Si

, nu

) = {Sj

| eSj 2

4

n1 n2 n4

n3

n5

S1 S2

S3

S4

(a) A trajectory hasthree t-fragments

n1 n2 n4

n3

n5

S1 S2

S3

S4

(b) An example of base clusters andflow cluster

Fig. 2: Examples of the elements in the NEAT model

Lnu(e

Si) & f(Si

, Sj

) > 0}.Let n

v

be the other endpoint of eSi . We define the f -neighborhood of S

i

wrt. eSi as: Nf

(Si

) = Nf

(Si

, nu

)[N

f

(Si

, nv

). Each Sj

2 Nf

(Si

) is called the f -neighborof S

i

. Note that the f -neighbor is a symmetric relation.

Definition 7. (maxFlow-neighbor) Let Si

denote a basecluster and n

u

denote one endpoint of eSi . We callSk

the maxFlow-neighbor of Si

at nu

, denoted bymaxFlow(S

i

, nu

), if f(Si

, Sk

) = max{f(Si

, Sj

)| Sj

2N

f

(Si

, nu

)}. f(Si

, Sk

) is called a maxFlow of Si

.

Definition 8. (flow cluster) A flow cluster (or a flow forshort) is an ordered list of base clusters, denoted byF = {S0, S1, ..., SN

}, where Si+1 2 N

f

(Si

)(0 i < N)and eS0eS1 ...eSN forms a route in the road network.We call eS0eS1 ...eSN the representative route of F ,denoted by r

F

.

Since a base cluster is comprised of t-fragmentsand a flow cluster is comprised of base clusters, wesay that a flow cluster is comprised of t-fragments.Therefore the definition of trajectory cardinality alsoapplies to a flow cluster. We define the netflow be-tween a flow cluster F and a base cluster S asf(F, S) = |PTr(F ) \ PTr(S)|.

Figure 2(a) shows a trajectory which can be rep-resented by a sequence of three t-fragments A0A1,A1A4 and A4A6. In Figure 2(b), we have five tra-jectories located on four road segments n1n2, n2n3,n2n4 and n2n5. There are three t-fragments whichlie on n1n2. Those three t-fragments are groupedtogether in base cluster S1 whose representative roadsegment is n1n2. In total, we have a set of base clustersB = {S1, S2, S3, S4}. The density of S1 is d(S1) = 4.Similarly, we have d(S2) = 3, d(S3) = 1 and d(S4) = 2.S1 is the dense-core of B since d(S1) = 4 is the highestdensity. The netflows among these base clusters are:f(S1, S2) = 2, f(S1, S3) = 1, f(S1, S4) = 1, f(S2, S3) =0 and f(S2, S4) = 1. The f -neighborhood of S1 wrt.n2 is N

f

(S1, n2) = {S2, S3, S4}, in which S2 is themaxFlow-neighbor of S1. The possible flow clustersinclude {S1, S2}, {S1, S3}, {S1, S4} and {S2, S4}.

2.3 NEAT Framework OverviewGiven a road network G = (V, E) and a set of Ntrajectories collected from mobile objects travellingon G, denoted by T = {TR1, TR2, ..., TRN

}, NEAT

performs road network aware trajectory clustering inthree phases:

Phase 1 - Base cluster formation: We transform thegiven set of MO trajectories into a set of trajectoryfragments (t-fragments). Then we organize these t-fragments into base clusters by grouping those t-fragments that correspond to the same road segmentinto one base cluster.

Phase 2 - Flow cluster formation: We selectively mergebase clusters into flow clusters based on the majormobility flows and the flow continuity inherent in thegiven set of trajectories.

Phase 3 - Flow cluster refinement: We optimize theclustering result using a density-based refinementmethod. Our density-based optimization modifies thewidely-used Hausdorff distance with the shortestpath measurement and adapts the DBSCAN cluster-ing algorithm [11].

The final result produced by NEAT is a parti-tioning of the given MO trajectories into a set oftrajectory clusters O = {C1, C2, ..., CK

} where eachcluster C

i

(0 i K) contains a set of trajectoryfragments satisfying two criteria: (1) High density: thetrajectory fragments in the same cluster are within thenetwork proximity of each other; (2) High continuity:the trajectory fragments in the same cluster show amajor traffic flow in the given trajectory data.

The NEAT system uses 3-tier client/server archi-tecture. Each client node acts as a mobile devicewhich records its locations, sends its trajectories toa NEAT server and makes requests to the serverto get trajectory clustering results for a particularroad network. NEAT server also distributes trajectorydatasets across multiple nodes in a cluster. These datanodes can perform some data preprocessing tasks.In this paper, we focus on the trajectory clusteringapplication running on the NEAT server.

n1 n2n3

n4

n5

n6

n9

n10

n12

n11

n13

n8

n1 n2 n3

n4

n6

n7

n9n10

n12

n11

n13

n8

n5

n1 n2 n3

n4

n5

n6

n7

n9n10

n12

n11

n13

n8

n1 n2 n3

n4

n6

n7

n9n10

n12

n11

n13

n8

n5

(1) Base Cluster Formation

(2)Flow Cluster Formation

(3)Flow Cluster Refinement

12 base clusters

F1 F3

F2

3 flow clusters:F1, F2, F3

2 clusters:C1 = {F2}

C2 = {F1, F3}

n7

Fig. 3: An example of three phase clustering in theNEAT framework.

Figure 3 illustrates the three phase NEAT frame-work. Consider a set of trajectories located on 12road segments in an example road network as shown

5

in the upper-left of Figure 3. In Phase 1, a set ofbase clusters are constructed from the given set oftrajectories (the upper-right of Figure 3). In Phase2, we utilize road network information and mobileobject movement characteristics to group base clustersinto flow clusters. Important operations include com-puting the netflows, finding the f -neighborhoods andcompute the maxFlow-neighbors. For our example,the result of this phase is a set of three flow clusters{F1, F2, F3} (the bottom-right of Figure 3). In Phase 3,we refine the resulting flow clusters by merging thoseflow clusters that are close in terms of a network baseddistance measure and a distance threshold. In Figure3, F1 and F3 are merged in Phase 3 to form a largertrajectory cluster. The two clusters C1 and C2 shownin the bottom-left of Figure 3 are the final result ofclustering for this specific example.

3 THREE PHASE TRAJECTORY CLUSTERINGIn this section, we present in details the algorithmsused in the NEAT clustering framework to producebase clusters in Phase 1, flow clusters in Phase 2 andthe final trajectory clusters in Phase 3.

3.1 Phase 1 - Base Cluster FormationWe perform the base cluster formation in two steps.First, we examine the input set of MO trajectoriesand partition each MO trajectory into a sequence oft-fragment. Second, we group those t-fragments thatbelong to the same road segments into one base cluster.

3.1.1 Partitioning Trajectories into t-fragmentsSince a mobile object moves along contiguous roadsegments, two consecutive locations recorded in a MOtrajectory are either on the same road segment or ontwo different road segments. In the latter case, thetwo different road segments are either contiguous orlie on the route (travel path) of the mobile object suchthat they are connected through a sequence of roadjunction nodes on the same path. For each trajectoryTR

k

= {tridk

, l0l1...ln} in the given trajectory dataset,we start the t-fragment extraction by examining TR

k

from the first location l0 to the last location ln

in thesequence of location samples {l0l1...ln} of the trajec-tory. Based on the fact that a mobile object has to gothrough the road intersection when moving betweentwo contiguous road segments, we take every twoconsecutive points in the trajectory, say l

i

and li+1,

and check if their road segment identifiers, denotedby sid(l

i

) and sid(li+1) respectively, are different. If

sid(li

) 6= sid(li+1), we know that l

i

and li+1 are on

different road segments. If they are contiguous, wecan obtain the road junction node that intersects thesetwo road segments. If the two road segments happento be not contiguous, we can obtain the sequence ofroad junction nodes connecting them on the travelpath of the object using the map-matching approach[15]. Next, we insert the obtained junction node(s) asnew points in between l

i

and li+1 in the trajectory

being examined. The junction nodes added to a tra-jectory in this phase are marked as different points

than the original location samples. After examiningevery point in a given trajectory TR

k

, the sequenceof junction nodes added to TR

k

will serve as the tra-jectory splitting points used to partition the trajectoryinto t-fragments. When a given set of trajectories aregiven as time series of geometric coordinates, NEATwill first preprocess the set of trajectories using map-matching (MM) algorithms such that each point ina trajectory is mapped to a road network locationas defined in Section 2.1. We use the SLAMM map-matching algorithm [15] in this data prepocessingstep. MM algorithms for bulk location data are moreeffective as noted in [15] because look-ahead andlook-around algorithms can catch many known er-rors of earlier MM algorithms, such as map-matchinglocation samples between two nearby parallel roadsegments.

By transforming a trajectory into a set of t-fragments, only the first point and the last point in theoriginal trajectory are kept, together with the newlyinserted road junction points. These points play crit-ical roles in extracting t-fragments and constructingbase clusters in the next step of Phase 1 as well asin subsequent phases of NEAT. Furthermore, the se-quence of t-fragments extracted from a trajectory stillmaintains the traveling route, the movement directionas well as the identifier of the original trajectory.

3.1.2 Grouping t-fragments into Base ClustersWe examine the t-fragments extracted from the MOtrajectories and group them by their road segmentidentifiers. Each group of t-fragments correspondingto a road segment forms one base cluster with theroad segment as its representative (Definition 2). Asdiscussed in Section 1, the t-fragments on the sameroad segment are considered close in terms of net-work proximity and they display similarity in themovement of their mobile objects. We compute thedensity of the resulting base clusters (Definition 4),then sort them by their densities in descending order.The output of Phase 1 is a sorted list of base clusterswith the first base cluster as the dense-core of theset of base clusters. The base clusters are used as thebuilding blocks of our flow-based clustering in thenext phase. We will use flow and density controlledmerging algorithms to construct the final trajectoryclusters of a given trajectory dataset T .

3.2 Phase 2 - Flow Cluster FormationThe flow-based clustering algorithm takes as an inputthe list of base clusters B produced from Phase 1. Itstarts by selecting one base cluster in B as the firstinitial flow cluster. It then expands this initial clusterby adding other base clusters one at a time such thatthe representative road segments of the base clustersselected for merging are concatenated to make a route.This expanding process will stop when every basecluster in B has been examined for its potential to bemerged with existing flow clusters. We consider flow,density and road speed limit as three characteristicsof a traffic stream to define a set of merging criteria.We construct flow clusters by grouping base clusters

6

according to these criteria. We discuss below how tochoose the initial base cluster and determine whichbase clusters are the best candidates for merging withthe flow cluster under consideration.

3.2.1 Density-based Flow Cluster InitializationIn the first prototype of NEAT, we take the dense-core of the density-ordered list of base clusters B tobegin the flow-based clustering process. There are atleast two reasons for choosing densecore(B) as theinitial flow cluster to start Phase 2. Given that a majortraffic stream usually covers the road segment(s) withthe highest traffic density in the road network, if werandomly pick a base cluster in B to initialize a flowcluster, it might lead to a flow cluster that describesa negligible traffic stream and will eventually befiltered out. In addition, choosing densecore(B) alsoguarantees that the set of base clusters are mergedin a deterministic order. Hence, the resulting flowclusters are always the same for the same input setof trajectories.

3.2.2 f -neighbor Merging for Flow ClustersIn this section, we describe how to merge a basecluster into an existing flow cluster. Suppose we arein the process of expanding a flow cluster F , which isin the form of an ordered list of base clusters, denotedby {S

a

Sa+1 . . . Sb�1Sb

}. We extend the list by insertinga base cluster either at the front or at the end ofthe list. Inserting a base cluster at the front of thelist is performed similarly to inserting a base clusterat the end of the list. Let n

u

denote the other endpoint of eSb other than the intersection I(eSb�1 , eSb) ofthe two representative road segments of base clustersSb�1 and S

b

. The base cluster Sb+1 to be added to

the end of the list has to be an f -neighbor of Sb

wrt.nu

, i.e. Sb+1 2 N

f

(Sb

, nu

). As it is well known thatflow, density and velocity are the three variables oftraditional theory for uninterrupted traffic flow [16],we choose S

b+1 among the base clusters in Nf

(Sb

, nu

)by considering the flow factor q, density factor k andspeed limit factor v.

Definition 9. Given a base cluster S and nu

as oneendpoint of road segment eS , the flow factor q, densityfactor k and speed limit factor v of a base cluster S

j

2N

f

(S, nu

) wrt. S are defined respectively as follows:

q = f (S, Sj

) /|PTr(S)| (1)

k = d (Sj

) /(d(S) +X

Si2Nf (S,nu)

d (Si

)) (2)

v = speed (Sj

) /X

Si2Nf (S,nu)

speed (Si

) (3)

where speed(Sj

) is the speed limit of eSj .

Definition 10. Given a base cluster S and nu

as oneendpoint of eS , the merging selectivity of a base clusterSj

2 Nf

(S, nu

) is defined as:

SF (S, Sj

) = wq

⇥ q + wk

⇥ k + wv

⇥ v (4)

where the coefficients wq

, wk

and wv

determine theweights of q, k, v respectively. The weights w

q

� 0,w

k

� 0 and wv

� 0 satisfy wq

+ wk

+ wv

= 1.

Here we assume that S is currently either the firstelement or the last element of a flow cluster F . Thus,selecting a base cluster to merge with F implies select-ing a base cluster to merge with S. The three factors q,k, v are computed for each candidate base cluster S

j

and normalized between [0,1] to better represent theirrelative capability of extending the flow cluster underexamination to form a route with high continuity anddensity as our clustering objectives. Specifically, theflow factor q measures the relative netflow betweenS and S

j

, thus, high q means that Sj

maintains highcontinuity of the traffic flow when merging with S.The density factor k measures the relative densityof S

j

among neighboring base clusters at the roadjunction shared with S, namely n

u

, which contributesto the density characteristic of the extended flow. Thespeed limit factor v measures how fast mobile objectscan move within S

j

compared to its neighbors at nu

,which is also a capable metric to find popular routesas mobile objects tend to travel following routes witheither shortest distances or shortest travel times. InDefinition 10, we combine all three factors q, k, vunder a weighing scheme to select the most relevantbase cluster to merge wrt. our clustering objectives.According to the above definitions, the base cluster inN

f

(S, nu

) which has the highest merging selectivity,denoted by SF

max

, will be chosen to merge with S.Each base cluster which has been merged into

a flow cluster is removed from B. We also usea threshold minCard for the trajectory cardinalityof a flow cluster to filter out those flow clusterswhose trajectory cardinalities are smaller than min-Card. Finally, the condition to stop expanding thelist {S

a

Sa+1 . . . Sb�1Sb

} at the end of the list iswhen S

b

has no f -neighbor wrt. its endpoint nu

, i.e.N

f

(Sb

, nu

) = ;. A similar condition is applied to stopexpanding the list at the front. When both conditionsare reached, we add the resulting flow cluster to W ,the output set of flow clusters in Phase 2. Then, webegin the next iteration of flow-based clustering withthe remaining base clusters in the list B with the sameprocess as described above, until all the base clustersare assigned to flow clusters, i.e. B becomes empty.

3.2.3 Decisions on f -neighbor MergingSuppose we are at a merging step and the highestmerging selectivity at this step is SF

max

. We dis-cuss the case when there are more than one basecluster which have the merging selectivity values ofSF

max

at a merging step. Suppose a base cluster Sis at one end of an intermediate flow cluster F andS has m equally qualified neighbors S1, S2, . . . , Sm

to merge wrt. their merging selectivity values, i.e.SF (S, S1) = . . . SF (S, S

m

) = SFmax

. In our flow-basedclustering algorithm described above, we randomlypick one base cluster in {S1, S2, . . . , Sm

}. We call it“random SF

max

” merging scheme. Here we proposetwo schemes in which instead of randomly picking,we carefully consider the potential merging oppor-tunity of each equally qualified base cluster, called

7

“look-back” and “look-ahead” schemes respectively.Both schemes focus on the flow property of the trafficstream.

Fig. 4: An example of a merging iteration in which abase cluster S at the end of an intermediate flow

cluster F has two neighbors withSF (S, S

i

) = SF (S, Sj

) = SFmax

Look-back Merging. We compare the netflowsbetween the flow cluster under examination andm base clusters f(F, S

i

), f(F, S2), ..., f(F, Sm

) andchoose one S

j

(1 j m) such that f(F, Sj

) =max{f(F, S1), f(F, S2), ..., f(F, Sm

)}. This means wewant to maintain the movement of as many objectswhich have already on the representative route of theflow cluster F as possible. So that we “look-back” tothe flow cluster F which has been extended so far andpick among those m candidates the one that share thelargest number of participating trajectories with F .

Look-ahead Merging. Let nu

be the intersection ofeS and eS1 , eS2 ,..., eSm . We consider the maxFlows ofthose m candidates at the other endpoints, which arenot n

u

, of their representative road segments. If themaxFlow of S

j

has the biggest value among them,we will select S

j

to merge with S. In this scheme, we“look-ahead” to the maxFlow at the other endpointof each candidate’s representative road segment andselect the largest one. By using “look-ahead” scheme,we want to make sure that we capture all the signifi-cant netflows while expanding a flow cluster.

An example of this case is illustrated in Figure 4where a base cluster S at the end of an interme-diate flow cluster F with SF (S, S

i

) = SF (S, Sj

) =SF

max

. In the “random SFmax

” scheme, Si

and Sj

have equal merging opportunity, thus, we randomlypick either S

i

or Sj

to merge with S. In the “look-back” scheme, we pick the base cluster which givesmax(f(F, S

i

), f(F, Si

)). In the “look-ahead” scheme,let S

i+1 be a maxFlow � neighbor of Si

and Sj+1

be a maxFlow � neighbor of Sj

, we pick the basecluster either S

i

or Sj

based on which one givesmax(f(S

i

, Si+1), f(Sj

, Sj+1)).

3.2.4 Weight Assignment CriteriaThe setting of the weights w

q

, wk

and wv

is usually de-termined by the specific location-based applications. Ifan application favors the three factors equally whenconsidering a traffic stream, we can set w

q

= wk

=w

v

= 1/3. If we set (wq

, wk

, wv

) = (0, 1, 0), we willmerge a base cluster with its densest f -neighbor.The resulting flows in this case will describe a routewhere traffic is highly concentrated. A combinationof (w

q

, wk

, wv

) = (0, 0, 1) will produce flow clustersthat describe the routes where objects can travel the

fastest. For traffic monitoring applications, the flowfactor and the density factor can be considered themost important factors so that we can set (w

q

, wk

, wv

)to (1/2, 1/2, 0). If the application emphasizes the flowproperty of a traffic stream and considers the flowfactor as the most important one, it can set w

q

= 1and w

k

= wv

= 0. Therefore, the maxFlow-neighborof S (Definition 7) will be selected to merge with S.

Beside user supplied weights, NEAT also supportssystem generated weights to make it run in an au-tomated manner. For system supplied weights, weintroduce an adaptive weight assignment scheme whichselects the most critical characteristics of the trafficstream based on the values of the flow, density andspeed limit factors at each merging step. Suppose weare in the process of forming a flow cluster. At eachmerging step, we compute the maximum values q

max

,kmax

, vmax

of the flow factor q, density factor k andspeed limit factor v respectively. We the assign theweights w

q

, wk

and wv

to be used at the correspondingmerging step as follows:

wq

= qmax

/(qmax

+ kmax

+ vmax

) (5)

wk

= kmax

/(qmax

+ kmax

+ vmax

) (6)

wv

= vmax

/(qmax

+ kmax

+ vmax

) (7)

Note that all three factors q, k, v are normalizedbetween 0 and 1. Thus, their maximum values ateach merging step q

max

, kmax

, vmax

can show therelative importance of each property of the trafficstream passing the road intersection corresponding tothe base cluster under consideration in the mergingstep. By adaptively assigning weights based on thosemaximum values as in Formula 5, 6, 7, NEAT putsmore weight on the critical factor at each merging stepand therefore, automatically discovers flow clusterswhich capture important traffic flow inherent from thegiven trajectory data.

The pseudo code of flow-based clustering is pre-sented in Algorithm 1. It performs on a set of baseclusters B until all the base clusters are assigned toflow clusters, i.e. B is empty (lines 3-11). Each roundof flow-based clustering starts with the dense core ofB (line 4). The procedure expandF low() (lines 19-32)performs f -neighbor merging to extend a flow fromits both sides (lines 8-9) given the factor weights w

q

,w

k

, wv

. If the input weights (wq

, wk

, wv

) are set to(0, 0, 0), the adaptive weight assignment scheme isused to adaptively compute the weights at each merg-ing step (line 21). It computes the f -neighborhoodof the base cluster at one end of the flow (line 19)then select the f -neighbor with maximum mergingselectivity to assign to the current flow. The flowcontinues to be expanded (line 31) as long as thereare candidates to be merged (line 23), otherwise itstops. We also filter out the flows with insufficientnumber of participating trajectories. We set minCardto equal to the average trajectory cardinality of theflow clusters in the set W . Those flow clusters whosetrajectory cardinalities are smaller than minCard willbe discarded (lines 13-17).

8

Algorithm 1 Flow-based clustering

Input: (1) A set of base clusters B = {S0, S1, ..., SM

}(2) A road network G(3) Factor weights w

q

, wk

, wv

Output: A set of flow clusters W = {F1, F2, ..., FQ

}1: flowId = 0;2: W = ;;3: while B 6= ; do4: S

c

= densecore(B);5: add S

c

to FflowId

;6: remove S

c

from B;7: {n

c1, nc2} =getEndPoints(eSc);8: expandFlow(S

c

, nc1, f lowId, wq

, wk

, wv

);9: expandFlow(S

c

, nc2, f lowId, wq

, wk

, wv

);10: flowId++;11: end while12: minCard =averagePtr(W );13: for each F

i

2 W do14: if Ptr(F

i

) < minCard then15: remove F

i

from W16: end if17: end for18: Procedure expandFlow(S, n

u

, f lowId, wq

, wk

, wv

){

19: Nf

=getfNeigborhood(S, nu

);20: if (w

q

, wk

, wv

) = (0, 0, 0) then21: (w

q

, wk

, wv

) = getAdaptiveWeights(S,Nf

);22: end if23: if N

f

6= ; then24: for each S

i

2Nf

do25: SF (S

i

) = mergeSelectivity(Si

, wq

, wk

, wv

);26: end for27: SF (S

j

) = maxSi2NfSF (S

i

) ;28: add S

j

to FflowId

;29: remove S

j

from B;30: n

v

=getEndPoints(eSj ) \ {nu

};31: expandFlow(S

j

, nv

, f lowId, wq

, wk

, wv

);32: end if33: }

3.3 Phase 3 - Flow Cluster RefinementThe third phase of NEAT is designed to exploit oppor-tunities to further merge some flow clusters producedfrom Phase 2. We describe in this section a density-based approach to group flow clusters. We modifythe Hausdorff metric to measure the distance betweentwo flow clusters and adapt DBSCAN algorithm [11]to merge flow cluster merging process. This optimiza-tion is especially effective for real time trajectory clus-tering where online clustering can be executed in anincremental and distributed manner. In particular, thefirst two phases of NEAT can be performed on eachnewly arrived set of trajectories. The new flow clustersare then merged with the available flow clusters toproduce compact clustering results.

3.3.1 Distance Function for Flow ClustersWe modify the popular Hausdorff distance functionto measure the distance between two flow clusters F

i

and Fj

in terms of network proximity. The distance

between two flow clusters Fi

and Fj

can be deter-mined by the distance between their representativeroutes r

Fi and rFj . In the first prototype of NEAT,

we measure the distance between rFi and r

Fj by thenetwork proximity of their ending locations. Whenthe ends of two flow clusters are within a predefinednetwork distance, we merge them into a larger clustersuch that the resulting cluster will be able to show agroup of frequent routes between two hotspot areasas illustrated in Figure 3.

Definition 11. Given two flow clusters Fi

2 W andFj

2 W , the distance between Fi

and Fj

is defined as:

distN

(Fi

, Fj

) = distN

�rFi , rFj

�

= max{maxa2{a1,a2}min

b2{b1,b2}dN (a, b),

maxb2{b1,b2}min

a2{a1,a2}dN (b, a)} (8)

where dN

(a, b) is the shortest path from a to b, and{a1, a2}, {b1, b2} are the two endpoints of r

Fi , rFj

respectively.

3.3.2 Density-based OptimizationWith the modified Hausdorff distance measure forflow clusters, we need an algorithm to merge theflow clusters when their distance is within someuser-defined or system-supplied default threshold. Weadapt the DBSCAN algorithm to group the set offlow clusters when the density opportunity exists. TheDBSCAN algorithm was originally used to cluster aset of data points and requires two parameters: a dis-tance threshold " between two points and a minimumnumber of points minPts in a cluster. An object is amember of a cluster if it has at least MinPts neigh-boring objects within a given radius ". All the objectsin its "-neighborhood are also members of the samecluster. Otherwise the object is classified as noise. Wemake the following modifications: (1) The data unit tobe clustered is a flow cluster; (2) The distance functionis our modified Hausdorff distance between two flowclusters; (3) No minimum cardinality is set for theresulting cluster; (4) The density-based clustering formerging flow clusters always starts each round withthe flow cluster whose representative route is thelongest. It ensures that the final clustering results arealways the same with the same distance threshold.This is different from DBSCAN in which data pointsare not processed in a deterministic order.

3.3.3 Flow-Distance Computation OptimizationAs shown in Formula 8, a distance computation foreach pair of flow clusters consists of four shortestpath computations. Note that d

N

(a, b) and dN

(b, a) arethe same since we consider undirected graphs. LetW = {F1, F2, . . . , FC

} (C > 1) denote the set of flowclusters generated from Phase 2 of NEAT. For each F

i

,we need to compare it with the rest C�1 flow clusters,one at a time, to determine whether there is a mergingopportunity. When the number of flow clusters in Wis large, the cost of computing the network proximityof a pair of flow clusters can be expensive. If we usethe popular Dijkstra’s network expansion algorithmto compute these shortest paths, the computation

9

cost can be high compared to the standard Euclideandistance, especially when the representative routes ofthe flow clusters are long in terms of segment counts.For a graph with n nodes and m edges, traditionalnetwork expansion based algorithms (e.g. Dijkstra,Floyd-Warshall) compute shortest paths for each nodepair in O(nlogn + m), while the Euclidean distancecomputation for a node pair only takes O(1).

In phase 3 of NEAT, we use the Euclidean lowerbound (ELB) property to reduce the number ofshortest path computations while retrieving the "-neighborhood of a flow cluster F

i

. By ELB, theEuclidean distance d

E

(li

, lj

) between two road net-work locations l

i

and lj

is always the lower boundof the network distance d

N

(li

, lj

), i.e. the conditionof d

E

(li

, lj

) dN

(li

, lj

) is always hold. Hence, ifdE

(li

, lj

) > ", we also have dN

(li

, lj

) > ". In case ofdN

(Fi

, Fj

), instead of computing four shortest pathsdN

(a1, b1), dN

(a1, b2), dN

(a2, b1) and dN

(a2, b2), wecompute four Euclidean between those locations first.If the minimum Euclidean distance between themexceeds ", we can filter F

j

from the search spacefor "-neighborhood of F

i

. Only when the minimumEuclidean distance between those points does not ex-ceed ", we calculate dist

N

(Fi

, Fj

) using our modifiedHausdorff distance to determine whether F

j

is in the"-neighborhood of F

i

or not.

3.4 ExtensionsThere is a number of interesting extensions of NEAT.In the first prototype of NEAT, t-fragment is thedefault unit for clustering. We can also integrate auser-defined fragment into our framework to let usersdefine the granularity of fragments. For examples, auser can define the grid cells covering a road networkas the building blocks in the base cluster formationphase. Then the same merging and refinement processcan be applied on the resulting cell-based or any user-defined fragment clusters. In addition, in Phase 3,beside our modified Hausdorff distance, other dis-tance functions which introduce interesting optimiza-tion results can also be included in this phase. Theendpoints of the flow clusters can be considered asplaces of interest. Thus, we can extend our modifiedDBSCAN-like optimization phase to group the set ofendpoints of the resulting flow clusters to discoverregions of interest inherent in the given trajectorydata. Furthermore, consider that each road networklocation is recorded with a specific timestamp, anotherinteresting extension to NEAT is to consider the tem-poral information contained in the MO trajectories.We can apply NEAT to an extracted sub-trajectoryset from the given trajectories, e.g. recorded mobilitytraces in rush hours or during weekends, to discoverthe traffic patterns during different time windows (interms of hour, day, week, month or year). Finally,we also plan to parallelize the NEAT algorithmsin future development. Specifically, in Phase 1, thesame method of splitting trajectories and putting t-fragments into their associated base clusters can runfor each trajectory in parallel. In Phase 2, we would di-vide the map into partitions and perform MapReduce-like jobs to retrieve the flow clusters. The last phase

TABLE 2: Datasets used in our experiments

Datasets Number of pointsATL SJ MIA

ATL/SJ/MIA500 114878 131982 276711ATL/SJ/MIA1000 233793 255162 452224ATL/SJ/MIA2000 468738 542598 893412ATL/SJ/MIA3000 669924 794638 1302145ATL/SJ/MIA5000 1277521 1296739 2262313

is essentially traditional clustering which can adaptmany available parallelized versions of traditionalclustering algorithms such as ones in the ApacheMahout project [17].

4 EXPERIMENTAL EVALUATIONWe perform five sets of experiments to evaluate theefficiency and effectiveness of our NEAT framework.Real road networks of different sizes are used in ourexperiments. In order to analyze the performance ofeach phase, we refer to the trajectory clustering usingPhase 1 of NEAT as the base-NEAT, the trajectoryclustering using the first two phases as the flow-NEAT, and the trajectory clustering using all threephases as opt-NEAT. NEAT allows users to performtrajectory clustering using any of these three versionsof NEAT. Base clusters, flow clusters and final tra-jectory clusters are the outputs of base-NEAT, flow-NEAT and opt-NEAT respectively, and each may haveits own attraction in terms of delivering interestingtrajectory clustering results to location-based applica-tions.

4.1 Experimental SetupWe use three real road networks in our experiments(Table 1). The road networks of North West Atlanta(ATL) and West San Jose (SJ) are obtained from [18].The Miami-Dade (MIA) road network is obtainedfrom [19]. We adapt the public event-based simulatorGTMobiSIM [20] to generate thousands of mobilitytraces on those road networks for a large-scale eval-uation. We use five trajectory datasets for each of theroad networks ATL, SJ and MIA. Table 2 gives the in-formation of our synthetic datasets. To create a trajec-tory dataset, for example SJ1000, we place 1000 mobileobjects on West San Jose road network to travel underspeed limit constrained on road segments, followingshortest paths to a final destination chosen randomlyfrom a predefined set of locations. We implement ouralgorithms using Java and visualize the results usingGTMobiSIM GUI. All the experiments are conductedon the NEAT server machine with Intel Core2 DuoCPU of 2.00GHz and 1GB of main memory allocatedfor the Java heap size.

4.2 Visualization of NEAT clustering resultsWe visualize the clustering results obtained afterprocessing trajectory datasets with the two phaseNEAT approach (flow-NEAT) and with the three-phase NEAT approach (opt-NEAT). Figure 5 showsthe clustering results for ATL500 dataset. Figure 5(a)plots 500 trajectories (in green color) on North West

10

TABLE 1: Road networks used in our experiments

Regions Total length # Segments # Junctions Avg. segment length Junction degreeNorth West Atlanta, GA 1384.4km 9187 6979 150.7m avg: 2.6, max: 6

West San Jose, CA 1821.2km 14600 10929 124.7m avg: 2.7, max: 6Miami-Dade, FL 26148.3km 154681 103377 169.0m avg: 3.0, max: 9

(a) Input data: ATL500 (b) 31 flow clusters (c) 2 clusters after optimizing phase (" =6500)

Fig. 5: Results for North West Atlanta road network

7

(a) 172 flow clusters for SJ5000

8

(b) 13 clusters after optimizing phase(SJ5000, " = 1200)

(c) 33 clusters after optimizing phase(MIA3000, " = 2000)

Fig. 6: Results for West San Jose and Miami-Dade road networks

Atlanta map. After the first two phases, 31 flow clus-ters are discovered (Figure 5(b)). These flow clusterscapture all the major traffic flows from the ATL500dataset. Some traces that we see in the original datasetdisappear in Figure 5(b) since there are too few objectsmoving in them. The threshold to filter those flowclusters in this experiment is minCard=5, which isthe average number of participating trajectories ineach of the flow clusters. There are two dense regionsthat concentrates the short flows. They are the twohotspots where we place the 500 mobile objects atthe beginning of their trips. After travelling on thoseshort flows, they start merging into the long flowsto reach one of the three destinations marked with

the red X-signs on the map. These 31 flow clustersare grouped into 2 clusters as shown in Figure 5(c)after performing the density-based flow cluster refine-ment. We start the density-based optimization withthe longest representative route (the dark red polylinenumber 30 in Figure 5(b)). This route connects to oneof the two hotspots and its endpoints are close to theendpoints of the other long flows (the polylines 29,28 and 27). As defined in the DBSCAN algorithm,they are in a density-connected set. Therefore, whenwe perform density-based clustering (with the dis-tance threshold " = 6500m) we have them groupedtogether in one cluster (contains the red polylinesin Figure 5(c)). The rest of the flows are grouped

11

into another cluster (contains the gray polylines). Theresults for West San Jose road network and Miami-Dade datatsets are shown in Figure 6. For SJ5000,flow-NEAT produces 172 flows (Figure 6(a)) and opt-NEAT produces 13 clusters with " = 1200m (Figure6(b)). For MIA3000, flow-NEAT produces 300 flowsand opt-NEAT produces 33 clusters with " = 2000m.The visualization for MIA3000 (Figure 6(c)) is not asclear as the other two datasets because Miami-Daderoad network, which is an urban area, is more denseand complicated than the ATL and SJ road networks(see Table 1 and 2).

4.3 Efficiency and Effectiveness of NEATIn this section we evaluate the efficiency and effective-ness of NEAT by comparing it with the conventionaldensity-based approach, represented by TraClus. Fig-ure 7(a) shows 81 resulting clusters when applyingthe TraClus algorithm with " = 10m and MinLns= 30 to cluster ATL500 dataset. Each cluster has itsrepresentative trajectory plotted and numbered on themap. Another result of 460 clusters for ATL500 isshown in Figure 7(b) when applying Traclus with "= 1m and MinLns = 1. As we see, these clustersare discrete on the road network. Their representativetrajectories are short in lengths. These clusters onlyshow short routes in the road network where there isdense traffic. They do not provide information aboutthe traffic continuity implied in the original trajectorydataset. Recall the NEAT clustering results shown inFigure 5, we can see that most of the important routesare missed when using TraClus. An interesting pointto note is that our framework with base-NEAT canalso provide this knowledge if we filter out those baseclusters where the density is below a specific thresh-old. The remaining base clusters will represent theroad segments where traffic is highly concentrated.

Figure 8(a) and Figure 8(b) shows the comparisonsof the average and maximum lengths of the rep-resentative routes discovered using flow-NEAT andTraClus. Compared to TraClus, flow-NEAT producesclusters with longer representative routes, which arefavorable for location based applications such as busline organizing or ride sharing [9]. This results in asmaller number of clusters produced by flow-NEATas shown in Fig 8(c). In comparison, NEAT producesmore compact and meaningful results through roadnetwork aware trajectory clustering.

By utilizing the road network information, NEATnot only produces meaningful trajectory clusters butalso runs very fast. Both traffic flow and traffic densitycan be discovered using flow-NEAT without the needto compute any distance function. We only computeshortest path distances in Phase 3 for clustering refine-ment and optimize this costly operation by using ELBfilter to eliminate the unnecessary shortest path com-putation. In constrast, TraClus depends heavily on thedistance measurements among every pairs of samplesin the trajectory dataset. This makes TraClus overallvery slow as the number of samples in each trajec-tory and the number of trajectories in each datasetare high. Figure 8(d) shows the efficiency compari-son between the three-phase NEAT framework and

(a) 81 clusters for ATL500("=10m, MinLns=30)

(b) 460 clusters for ATL500("=1m, MinLns=1)

Fig. 7: TraClusTABLE 3: Number of flow clusters produced by

opt-NEATDatasets SJ500 SJ1000 SJ2000 SJ3000 SJ5000# flows 73 156 55 52 180

TraClus framework by varying the number of pointsin ATL datasets. TraClus is very time-consuming andthe time complexity grows as the number of pointsgets larger. TraClus runs in 2573.5 seconds to clus-ter ATL500 (114878 points) and 334735.1 seconds tocluster ATL5000 (1277521 points). While opt-NEATonly takes 1.29 seconds to cluster ATL500 and 59.7seconds to cluster ATL5000. Compared to TraClus, theNEAT framework is faster by more than three ordersof magnitude.

One may ask what if TraClus is given the benefitof our map-matching preprocessing step to partition atrajectory into trajectory fragments and uses a networkdistance measure such as our modified Hausdorff functionin its grouping phase? To address this concern, we haverun a variant of TraClus on our test datasets (thegraphing result is omitted due to space limit). In thisvariant of TraClus, we even provide TraClus with thepartitioning of trajectories into base clusters insteadof t-fragments, then the grouping phase merges thebase clusters using our modified Hausdorff distance.Note that the number of base clusters is usuallymuch smaller than that of t-fragments. However,TraClus remains slow compared to NEAT due totheir grouping algorithm which heavily depends ondistance computations and the resulting clusters onlyshow discrete traffic density in the road network. Forinstance, with the SJ2000 dataset (226151 t-fragments,901 base clusters), this variant of TraClus took 6396.79seconds to finish with 117 resulting clusters. WhileNEAT produced a more compact results of 42 flowclusters and 14 final clusters in only 11.68 seconds.

4.4 Performance of NEAT AlgorithmsWe analyze the efficiency of NEAT by analyzing theperformance of different versions of NEAT during thethree phases, focusing on the impact of the flow prop-erty of traffic streams in NEAT design. The scaling ofbase-NEAT, flow-NEAT and opt-NEAT for differentMIA datasets are shown in Figure 9(a). The curvesare almost linear with the growth of dataset size. In

1231 37 42

7949

81 84

167

291

212

0

50

100

150

200

250

300

350

ATL500 ATL1000 ATL2000 ATL3000 ATL5000

Num

ber o

f clu

ster

s

NEAT traclus

0

500

1000

1500

2000

2500

ATL500 ATL2000 ATL5000

met

ers

NEAT TRACLUS

(a) Average length

0

5000

10000

15000

20000

25000

ATL500 ATL2000 ATL5000meters

NEAT TRACLUS

(b) Maximum length

0

5000

10000

15000

20000

25000

ATL500 ATL2000 ATL5000

met

ers

NEAT TRACLUS

31 37 4279

4981 84

167

291

212

0

50

100

150

200

250

300

350

Num

ber o

f clu

ster

s

NEAT traclus

(c) Numbers of resultingclusters

1

10

100

1000

10000

100000

114878 233793 468738 669924 1.27752e+006

Seco

nds

(log-

scal

e)

Number of Points

NEATTraClus

(d) Running times

Fig. 8: Comparison of flow-NEAT and TraClus (ATL datasets)

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories

base-NEAT

flow-NEAT

opt-NEAT

0

5

10

15

20

25

30

35

40

0 1000 2000 3000 4000 5000

Seco

nds


base-NEAT

flow-NEAT

opt-NEAT

0

50

100

150

200

250

0 1000 2000 3000 4000 5000

Seco

nds


base-NEATflow-NEATopt-NEAT

(a) MIA datasets

0

20

40

60

80

100

120

Second

s

Base Cluster Formation

Flow Cluster Formation

(b) Relative performance ofbase cluster formation andflow cluster formation

Fig. 9: Relative performance of our approaches

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000

Seco

nds


opt-NEAT-Dijkstra

opt-NEAT using ELB

0

50

100

150

200

250

300

350

400

450

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories )

opt-NEAT-Dijskstra

opt-NEAT using ELB

(a) ATL datasets

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000

Seco

nds


opt-NEAT-Dijkstra

opt-NEAT using ELB

0

50

100

150

200

250

300

350

400

450

0 1000 2000 3000 4000 5000

Seco

nds


opt-NEAT-Dijskstra

opt-NEAT using ELB

(b) SJ datasets

Fig. 10: Effectiveness of using Euclidean lower bound

all cases, the flow cluster refinement phase contributesvery little to the total running time, as shown in thegraphs, due to the ELB based optimization. The opt-NEAT curves nearly overlap the flow-NEAT curves.

We further investigate the relative performance ofPhase 1 (base cluster formation) and Phase 2 (flowcluster formation). Phase 1 algorithm takes the roadnetwork locations as its data units because it scans thesequence of points in all the trajectories to extract t-fragments. Phase 2 algorithm takes the base cluster asits data unit. Intuitively, the number of road networklocations in the trajectory dataset is much larger thanthe number of base clusters produced from Phase 1.Thus, it takes longer to complete Phase 1. This isconfirmed by our experiments shown in Fig 9(b).

Figure 10 unveils the effectiveness of using ELBto reduce the number of shortest path computations(opt-NEAT-ELB) versus using Dijkstra’s network ex-pansion algorithm to compute all the shortest paths(opt-NEAT-Dijskstra) when performing density-basedoptimization in the NEAT framework. In Figure 10(a),

the opt-NEAT-Dijskstra curve goes up faster as thedataset size grows. However, the curve of opt-NEAT-Dijkstra in Figure 10(b) shows that the cost at SJ1000are much higher than at SJ2000 and SJ3000. This is dueto the cost of Phase 3, which computes shortest pathdistances, actually depends on the number of flowsproduced by Phase 2 and not the data size. Table 3shows the number of resulting flows output from thesecond phase where numbers of flows in SJ1000 andSJ5000 are much higher than that in other datasetsfor SJ road network. Using ELB significantly speedsup the performance of the density-based optimizationalgorithm (see some big gaps between the two curvesopt-NEAT-ELB and opt-NEAT-Dijkstra).

4.5 Effects of f -neighbor Merging Decisions onFlow-based ClusteringWe study how using different merging schemes in-cluding “random SF

max

” (“random” for short), “look-back” and “look-ahead” schemes, as described inSection 3.2.3, affects the results of flow-NEAT. Wemeasure the cluster quality by computing the totalnetflows of the resulting flow clusters. The resultsare reported for ATL and SJ datasets in Figure 11including the running time (Figures 11(a) and 11(b))and total netflows (Figures 11(c) and 11(d)) of eachscheme. It is shown that in most cases, look-headmerging runs faster than look-back scheme and ran-dom merging is the fastest. Although the cost in-curred by running look-ahead or look-back schemescompared to random merging is not significant, thetotal netflows of the resulting flow clusters in all threeschemes are approximately the same. This resultsfrom the low junction degree of road networks wherethe average junction degree is usually less than 3.We conclude that in the context of a road network,“random SF

max

” merging is good enough to getgood clustering result and thus, NEAT uses “randomSF

max

” merging in Phase 2.

4.6 Effectiveness of Adaptive Weight Assignmenton Flow-based ClusteringIn this section, we evaluate the effectiveness of ouradaptive weight assignment scheme (Section 3.2.4) com-pared to some predefined combinations of the weights(w

q

, wk

, wv

) for flow factor, density factor and speedlimit factor respectively. The different assignments of

13

0

2

4

6

8

10

12

14

16

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories (ATL datasets)

random

look-back

look-ahead

0

100000

200000

300000

400000

500000

600000


Tota

l net

flow

s

random

look-back

look-ahead

(a)

0

2

4

6

8

10

12

14

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories (SJ datasets)

random

look-back

look-ahead

0

100000

200000

300000

400000

500000

600000

SJ500 SJ1000 SJ2000 SJ3000 SJ5000

Tota

l net

flow

s

random

look-back

look-ahead

(b)

0

2

4

6

8

10

12

14

16

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories (ATL datasets)

random

look-back

look-ahead

0

100000

200000

300000

400000

500000

600000


Tota

l net

flow

s

random

look-back

look-ahead

(c)

0

2

4

6

8

10

12

14

0 1000 2000 3000 4000 5000

Seco

nds

Number of Trajectories (SJ datasets)

random

look-back

look-ahead

0

100000

200000

300000

400000

500000

600000

SJ500 SJ1000 SJ2000 SJ3000 SJ5000

Tota

l net

flow

s

random

look-back

look-ahead

(d)

Fig. 11: Effects of using random, look-back and look-ahead f -neighbor merging schemes

TABLE 4: Total netflows

(1/2,1/2,0) adaptive (1,0,0)ATL3000 225713 271782 271986

SJ3000 275665 326723 326870MIA3000 812505 921476 921910

TABLE 5: Total cluster density

(1/2,1/2,0) adaptive (0,1,0)ATL3000 225713 374234 374268

SJ3000 451783 475508 475536MIA3000 1241101 1276109 1277511

(wq

, wk

, wv

) represent different merging criteria whenwe perform f -neighbor merging. In the previous ex-periments, we want to focus on the flow property of atraffic stream so we set the highest weight for the flowfactor (w

q

= 1 and wk

= wv

= 0), i.e. a base clusterwill always be merged with its maxFlow-neighbor.Since the speed limit is fixed for each type of road,the speed limit factor is the least favorable amongthree factors. In our experiment with the adaptiveweight assignment, we skip the speed limit factor bysetting v

max

value at each merging step to 0. We runflow-NEAT with this “adaptive weights” scheme andthree predefined weight settings of (1, 0, 0), (0, 1, 0)and (1/2, 1/2, 0). The (1/2, 1/2, 0) weight assignmentis used as a base setting where both flow and densityare weighed equally. We measure the total netflows ofthe discovered flow clusters by running flow-NEATusing the “adaptive weights” scheme, compared tousing the “netflows only” (1, 0, 0) weight setting. Theresults are reported in Table 4. For ATL3000 dataset,our adaptive weights scheme achieves an increase oftotal netflows of 18.52% compared to the base settings(1/2, 1/2, 0), while that of the (1, 0, 0) setting is 20.41%.Similarly for SJ3000 and MIA3000, the percentageof total netflows increased using adaptive weightsscheme is also so close to that of (1, 0, 0) setting.With the same three datasets, Table 5 reports thetotal cluster density of the discovered flow clustersby running flow-NEAT using the “adaptive weights”scheme, compared to using the “density only” (0, 1, 0)weight setting and (1/2, 1/2, 0) setting. In all cases, thepercentage of total cluster density increased using ouradaptive weights scheme compared to the base settingof (1/2, 1/2, 0) is very close to that of the (1, 0, 0)setting (the difference is below 0.1%). Therefore, ouradaptive weight assignment scheme can automati-cally captures highly dense and continuous trafficflow from trajectories in the road network.

5 RELATED WORK

The clustering problem has been extensively re-searched in mobile ad hoc networks (MONET) anddata streaming systems. In MONET, data unit to beclustered is the mobility node where the characteris-tics of a node are taken into account for cluster headchoices in order to conserve energy and connectivity[21] [22] [23]. In data streams, k-means and density-based clustering algorithms have been extended tocluster large volume of multi-dimensional data pointsgenerated by sensor networks [24] [25] [26].

Most existing work on MO trajectory clustering [13][27] [28] [29] [30] [31] [32] [33] has derived proximitymeasures for trajectories and adapted traditional k-means, hierarchical, or density-based clustering al-gorithms to group similar trajectories. Trajectory-OPTICS [27], which extends OPTICS algorithm [12],is a good example for grouping similar trajectoriesas a whole. The distance between two trajectories isthe average Euclidean distance between two objectsfor every timestamp. Traclus [13] aims to find similarsub-trajectories rather than the whole trajectories. Itpartitions each trajectory into line segments usingthe Minimum Distance Length (MDL) principle andthen performs a DBSCAN-like [11] clustering on linesegments. The similarity measure is composed ofthree Euclidean based distance components betweenline segments. As a result, discovered clusters aredense regions of line segments. [31] adapts TraClusfor online trajectory clustering. [32] extends TraClusfor trajectory classification. However, these density-based methods cluster free space trajectories, i.e. with-out considering the constrained road network. Otherworks [28] [29] [30] consider the network constraint toderive similarity measures but they only focus on thedensity aspect of the given trajectories such as objectdensity [29], common road segments [28], shortestpath distance [30]. Our approach can produce par-tial clustering but carefully considers the constrainedroad network focusing on both flow and densitycharacteristics and avoids the expensive shortest pathcomputation in its first two phases. NEAT innovatesTraClus framework in a creative manner with higherefficiency (due to reduction of network distance com-putation) and higher accuracy (due to incorporationof flow semantics). This paper provides a full-fledgeddevelopment of the initial NEAT approach [34] with athorough study on different merging decisions and anadaptive parameter assignment scheme capturing themost critical characteristics of a traffic stream in the

14

core of NEAT, which allows it to discover importantflow clusters in an automated manner.

6 CONCLUSIONWe have presented NEAT - a novel road-networkaware approach to MO trajectory clustering whichclusters MO trajectories in a comprehensive threephase framework. We introduce the concept of f -neighborhood to identify the most critical and mostinteresting parts of the given MO trajectories wrt.clustering. Instead of taking points or line segmentsas the clustering unit as in traditional approaches, weintroduce t-fragment, base cluster and flow cluster asthe basic building blocks for road-network aware tra-jectory clustering. NEAT carefully combines the trafficlocality, flow and density metrics in its three-phasetrajectory clustering framework which significantlyreduces the data space in each subsequent phase andensures trajectory clustering quality. By carrying outextensive experiments, we show that NEAT discoversclusters of MO trajectories which represent majortraffic stream in a road network and outperformsconventional density-based trajectory clustering algo-rithms in terms of both time complexity and accuracy.

ACKNOWLEDGMENTThis work is partially sponsored by grants from NSFCISE NetSE and CrossCutting program, an IBM SURgrant, an IBM faculty award and Intel ISTC.

REFERENCES[1] B. Hull et al., “CarTel: A Distributed Mobile Sensor Computing

System,” in Proc. SenSys’06.[2] V. P. Chakka et al., “Indexing large trajectory data sets with

SETI,” in Proc. CIDR’03.[3] S. Brakatsoulas et al., “Practical data management techniques

for vehicle tracking data,” in Proc. ICDE’05.[4] V. Almeida et al., “Indexing the trajectories of moving objects

in networks,” GeoInformatica, vol. 9, 2005.[5] M. Haridasan et al., “Startrack next generation: A scalable

infrastructure for track-based applications,” in Proc. OSDI’10.[6] J. Jeong et al., “TSF: Trajectory-based statistical forwarding for

infrastructure-to-vehicle data delivery in vehicular networks.”in Proc. ICDCS’10.

[7] H. Qin et al., “An integrated network of roadside sensors andvehicles for driving safety: Concept, design and experiments.”in Proc. PerCom’10.

[8] P. Mohan et al., “Nericell: Rich monitoring of road and trafficconditions using mobile smartphones,” in Proc. SenSys’08.

[9] E. Kamar and E. Horvitz, “Collaboration and shared plans inthe open world: studies of ridesharing,” in Proc. IJCAI’09.

[10] The ELBA project, http://www.e-lba.com.[11] M. Ester et al., “A density-based algorithm for discover-

ing clusters in large spatial databases with noise,” in Proc.SIGKDD’96.

[12] M. Ankerst et al., “Optics: ordering points to identify theclustering structure,” in Proc. SIGMOD’99.

[13] J. gil Lee et al., “Trajectory clustering: a partition-and-groupframework,” in Proc. SIGMOD’07.

[14] M. L. Yiu and N. Mamoulis, “Clustering objects on a spatialnetwork,” in Proc. SIGMOD’04.

[15] M. Weber et al., “On map matching of wireless positioningdata: a selective look-ahead approach,” in Proc. GIS’10.

[16] M. Treiber and A. Kesting, Traffic Flow Dynamics: Data, Modelsand Simulation. Springer, 2012.

[17] http://www.mahout.apache.org/.[18] USGS data (.svg), http://edc2.usgs.gov/geodata/index.php.[19] TIGER/Line, http://www.census.gov/geo/www/tiger.[20] GTMobiSIM, http://code.google.com/p/gt-mobisim/.[21] M. Chatterjee et al., “WCA: A weighted clustering algorithm

for mobile ad hoc networks,” Cluster Computing, vol. 5, 2002.

[22] Y.Wang et al., “An entropy-based weighted clustering algo-rithm and its optimization for ad hoc networks.” in Proc.WiMob’07.

[23] D. Wei et al., “Clustering ad hoc networks: Schemes andclassifications,” SECON’06, vol. 3, 2006.

[24] C. Aggarwal et al., “A framework for clustering evolving datastreams,” in Proc. VLDB’03.

[25] Y. Chen et al., “Density-based clustering for real-time streamdata,” in Proc. KDD’07.

[26] P. Rodrigues et al., “Clustering distributed sensor datastreams,” in Proc. ECML PKDD’08.

[27] M. Nanni et al., “Time-focused clustering of trajectories ofmoving objects,” J. Intell. Inf. Syst., vol. 27, 2006.

[28] Jung-Im Won et al., “Trajectory clustering in road networkenvironment.” in Proc. CIDM’09.

[29] A. Kharrat et al., “Clustering algorithm for network constrainttrajectories.” in Proc. SDH’08.

[30] G. Roh et al., “NNCluster: An efficient clustering algorithmfor road network trajectories.” in Proc. DASFAA’10.

[31] Z. Li et al., “Incremental clustering for trajectories,” in Proc.DASFAA’10.

[32] J. Lee et al., “Traclass: Trajectory classification using hierar-chical region-based and trajectory-based clustering,” in Proc.VLDB’08.

[33] G. Ananthanarayanan et al., “Startrack: A framework forenabling track-based applications,” in Proc. MobiSys’09.

[34] B. Han et al., “Neat: Road network aware trajectory cluster-ing,” in Proc. ICDCS ’12.

Binh Han received the B.Sc degree in Com-puter Science from Hanoi University of Sci-ence and Technology, Vietnam. She is cur-rently a Ph.D. candidate in the School ofComputer Science at Georgia Tech. Sheis working in the Distributed Data IntensiveSystems Lab. Her research interests spanmobile data management, parallel databasesystems and cloud computing.

Ling Liu is a full professor in the School ofComputer Science at Georgia Tech. There,she directs the research programs in the Dis-tributed Data Intensive Systems Lab (DiSL),examining various aspects of data inten-sive systems with a focus on performance,availability, security, privacy, and energy effi-ciency. She was a recipient of the best paperaward at ICDCS 2003, WWW 2004, and the2008 International Conference on SoftwareEngineering and Data Engineering, and also

won the 2005 Pat Goldberg Memorial Best Paper Award. She hasserved as the general chair or PC chair for numerous IEEE and ACMconferences in the data engineering, distributed computing, servicecomputing, and cloud computing fields and is a coeditor-in-chief ofthe five-volume Encyclopedia of Database Systems (Springer). Cur-rently, she is on the editorial boards of several international journals,such as Distributed and Parallel Databases (DAPD, Springer), theJournal of Parallel and Distributed Computing (JPDC), ACM Trans-actions on Web, IEEE Transactions on Service Computing (TSC),and Wireless Network (WINET, Springer). Her current research isprimarily sponsored by the US National Science Foundation, IBM,and Intel. She is a senior member of the IEEE.

Edward Omiecinski received the PhD de-gree from Northwestern University in 1984.He is currently an Associate Professor atGeorgia Tech in the College of Computing.He has published more than 60 papers ininternational journals and conferences deal-ing with database systems. His research hasbeen funded by NSF, DARPA and NLM. Hiscurrently funded work deals with the dis-covery of knowledge in cardiac imagebases,which is a collaborative effort between Geor-

gia Tech and Emory University researchers. He is a member of theACM and IEEE Computer Society.

http://www.e-lba.com

http://www.mahout.apache.org/

http://edc2.usgs.gov/geodata/index.php

http://www.census.gov/geo/www/tiger

http://code.google.com/p/gt-mobisim/

road-network aware trajectory clustering: integrating …lingliu/papers/2014/neat-tmc.pdf1...

Documents