mining signiﬁcant semantic locations from gps datavldb2010/proceedings/files/papers/r… ·...

Mining Significant Semantic Locations From GPS Data

Xin Cao† Gao Cong† Christian S. Jensen‡

†School of Computer Engineering, Nanyang Technological University, [email protected], [email protected]

‡Department of Computer Science, Aarhus University, [email protected]

ABSTRACTWith the increasing deployment and use of GPS-enabled devices,massive amounts of GPS data are becoming available. We pro-pose a general framework for the mining of semantically meaning-ful, significant locations, e.g., shopping malls and restaurants, fromsuch data.

We present techniques capable of extracting semantic locationsfrom GPS data. We capture the relationships between locationsand between locations and users with a graph. Significance is thenassigned to locations using random walks over the graph that prop-agates significance among the locations. In doing so, mutual re-inforcement between location significance and user authority is ex-ploited for determining significance, as are aspects such as the num-ber of visits to a location, the durations of the visits, and the dis-tances users travel to reach locations. Studies using up to 100 mil-lion GPS records from a confined spatio-temporal region demon-strate that the proposal is effective and is capable of outperformingbaseline methods and an extension of an existing proposal.

1. INTRODUCTIONMobile devices, e.g., phones and navigation systems, with built-

in GPS (Global Positioning System) receivers are capable of gen-erating massive amounts of GPS records that capture geo-location,time, and a number of other attributes such as heading and speed.

The GPS records from a moving object approximate the object’strajectory, which is a continuous function from time to space. Tra-jectories capture how moving objects use geographical space. Forexample, they capture the locations that the objects visit along withthe durations of the visits. When projecting a trajectory onto space,a route results.

This paper develops techniques that exploit massive amounts ofGPS records collected from multiple users for identifying top-k sig-nificant semantic locations. In particular, the aim is to identify lo-cations that are semantically meaningful to users, e.g., shoppingmalls, restaurants, or tourist attractions, rather than simply identi-fying raw geographical coordinates.

Automatic extraction of meaningful stay locations and assign-ing significance to these are fundamentally important and useful.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were presented at The36th International Conference on Very Large Data Bases, September 13-17,2010, Singapore.Proceedings of the VLDB Endowment, Vol. 3, No. 1Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00.

For example, the outcome can serve as travel recommendationsthat take into account location significance and the distance fromthe user to a location. This is somewhat analogous to web querieswhere web pages are ranked in terms of their inherent importanceas well as their relevance to a query: location significance playsthe role of web page importance, and the distance from a user to alocation plays the role of the relevance of a web page to a query.Another example is that the outcome can be combined with the so-called location-aware keyword query [6].

Top-k hot semantic locations can be computed with regard toa specific context such as a certain user group (e.g., teens, seniorcitizens, tourists) or a certain time period (e.g., a vacation period ora time of day). This may be done by exploiting different GPS datafor different contexts.

GPS data is already attracting attention. Scores of communities-based web sites, e.g., targeting hikers, bikers, and cyclists, enableusers to share routes and trajectories. However, these sites simplyfocus on the sharing of trajectories. Some recent studies [2, 10, 24]consider the extraction of locations from GPS records, but do notconsider location significance.

Notably, Zheng et al. [23] pioneer the extraction and ranking oflocations from GPS data. We build on this work by consideringsemantic rather than raw locations. Semantic locations must beconsidered throughout to obtain the best possible results. For ex-ample, this avoids situations where different raw locations that turnout to represent the same semantic location are ranked as separatelocations; and it avoids cases where a single raw location representsmore than one semantic location. A post-processing step cannot fixsuch problems.

It is challenging to propose a model for location ranking that iscapable of exploiting multiple factors. The mutual reinforcementof user authority and location significance is considered by Zhenget al. However, as their approach uses the HITS algorithm [11],improper weights may be assigned to links, as discussed by Bharatand Henzinger [5] (to be explained in Section 3.3.2). Our model isable to exploit this mutual reinforcement relationship better. Moresignificantly, our framework is capable of systematically exploit-ing other factors such as the relationships between locations, thedistances between locations, and the durations of visits.

Our framework includes techniques for extracting semantic lo-cations from GPS records. This problem is nontrivial because GPSrecords with different coordinates may represent the same semanticlocation. Thus, we group stay points extracted from GPS recordssuch that each group represents a unique real-world location. Weleverage existing clustering algorithms for the initial grouping, andwe enhance the clustering results by taking into account the pat-terns of visits to the clusters and the similarity of the semantics ofthe clusters obtained from a reverse geocoder and yellow pages.

1009

Next, we propose a new model for location ranking that is capa-ble of propagating significance between locations, supports mutualreinforcement between location significance and user authority, andis capable of exploiting factors such as the number of visits to a lo-cation and the durations of the visits.

To achieve this, we model the connections between locationsand also the connections between locations and users using a two-layered graph with location-location and user-location components.Intuitively, if a location, e.g., a hotel, is connected via trajectories tomany other significant locations, e.g., restaurants, the hotel is alsoconsidered as significant. Further, locations visited by authoritativeusers are considered as more significant; and users gain authorityby visiting significant locations [23].

A PageRank-like model [18] can capture the location-locationinteractions, and a HITS-like model [11] can capture the reinforce-ment between users and locations. However, they cannot captureboth the location-location and the user-location components. Wepropose a new model that is able to accommodate both.

Intuitively, a location visited by a user from a place far away ismore likely to be more important than a place visited from a nearbylocation; and the longer the durations of the stays at a location,the more important the locations is considered to be. Our modelaccommodates these aspects.

In summary, the paper’s contributions are threefold. First, wepropose new techniques for extracting semantic locations from GPSdata. Second, we propose a new model for assigning significance tothe extracted locations. Third, we report on empirical studies withlarge quantities of GPS data that suggest that the proposed frame-work is (i) capable of improving the abilities of the OPTICS [1] andK-means clustering algorithms to extract semantic locations and is(ii) capable of significantly outperforming several baseline meth-ods, including rank-by-visits, rank-by-durations, and an extensionof an existing approach.

Section 2 states the problem addressed. Section 3 details theproposed location mining framework, including the extraction ofstays and their clustering into semantic locations, the constructionof the link structures connecting users and locations, and the rank-ing of locations using random walk models. Section 4 reports onthe experimental study. Section 5 concludes and discusses researchdirections. An appendix contains supplementary matter.

2. PROBLEM STATEMENTWe proceed to cover the data model used in this paper and the

problem addressed. Appendix A delves into application scenarios.

2.1 Data ModelWhile the proposed framework applies to GPS data in general,

we assume that the GPS data represents vehicle movements. Forspecificity, we assume a sampling frequency of one record per sec-ond when a vehicle moves.

DEFINITION 1. GPS record: A GPS record G is a five-tuple:〈u, t, x, y, s〉, where u is the ID of the user for which G is recorded,t is the timestamp, x and y are the Euclidean coordinates, and s isthe vehicle’s instant speed as reported by the GPS device.

An example is 〈1715, 2007-07-13 17:20:36, 544456, 6335497,0〉 where the coordinates are given in the UTM (Universal Trans-verse Mercator) coordinate system.

By ordering the records from a user by the timestamp t, we ob-tain a representation of the user’s trajectory.

DEFINITION 2. Trajectory: A trajectory T R of a user is a se-quence of GPS records G for the user that are ordered by the times-tamp t of the records, T R = G1 → · · · → Gi → · · · → Gn.

A trajectory describes the movement of a user. The original GPSrecords can be seen as the set of the trajectories of all the users,denoted as ST R. From a trajectory, we can compute the sequenceof locations visited by the corresponding user as well as how longthe user stays in the locations.

We are interested in locations where users have stayed for longerthan some predefined duration, as such locations are more likely tobe interesting. We extract all such stays from ST R.

DEFINITION 3. Stay point: A stay point P is a pair 〈G, ∆t〉,where G is the ID of a GPS record that represents the stay and ∆tis the duration of the stay.

Each time a user stays at a location, we can obtain a correspond-ing stay point. Thus, a location visited multiple times by one ormultiple users obtains multiple stay points. Stay points for the samereal-world location almost certainly contain different coordinates,although they are close to each other. The challenge is to deriveconsolidated locations from a collection of stay points.

Ideally, each consolidated location corresponds to a unique real-world location, e.g., a shopping mall or a tourist attraction.

DEFINITION 4. Semantic Location: A semantic locationL is acluster of stay points and is represented by a four-tuple 〈x, y, sem,sl〉, where x and y are the centroid of the cluster, sem is the seman-tics of the location, and sl is a set of stay points associated with thesemantic location. We denote the set of all extracted semantic lo-cations as SL

For example, semantic location 〈555122, 6321850, “AalborgZoo”, {864, 1354, 57720, . . .} 〉 has centroid (555122, 6321850),“Aalborg Zoo” as the semantics, and {864, 1354, 57720, . . .} asGPS record identifiers that represent stay points.

We are now ready to define the location history of a user.DEFINITION 5. Location History: The location history H of a

user is defined as a sequence H = L1 → · · · → Li → · · · → Ln

of semantic locations L.

Given a set of GPS trajectories ST R we can build the set of allusers’ location histories, SH.

2.2 Problem CharacterizationWe aim to identify and assign significance to semantic locations

based on large amounts of GPS records.Just as web queries issued to a search engine do not have ground

truth results, neither do the results for our problem. Rather, wecombine a range of indicators of “interestingness” in order to ob-tain significant locations. Specifically, we consider the followingindicators of the significance of a location: (1) the number of vis-its, (2) the durations of the visits, and (3) the distances users travelto visit locations.

Such indicators may be combined in various ways using loca-tion histories. Specifically, various kinds of reinforcement may beapplied. Thus, locations that are visited together with significantlocations may be assigned increased significance. And users whotend to visit significant locations may be assigned increased author-ity, which may then be used for assigning increased significance tolocations visited by these users.

While a framework that is capable of taking into account manysuch indicators is likely to have advantages over less capable frame-works, the interpretation of a result by a user is subjective. Differ-ent users will have different interests and preferences. We thus donot distinguish among “significant,” “hot,” “interesting,” “attrac-tive,” and “popular” as predicates of locations.

By considering location histories selectively, results with differ-ent meanings can be obtained. For example, by only considering

1010

GPS data from tourists, or international tourists, it is possible to tar-get interesting locations at specific user groups. This filtering maybe done based on demographic data about the users who contributethe data, which may be available to, e.g., rental car and insurancecompanies. While important, this is orthogonal to the paper’s con-tribution.

We also note that while some trips may carry little meaning,the aggregate results obtained from very large numbers of trips arelikely to be robust to such noise and thus yield meaningful results.This is analogous to the use of clicks or links for the assignment ofimportance to web pages.

Given a set of GPS records, we first pre-process the data to obtainthe set of trajectories, ST R. We address two sub-problems:

1. Development of techniques that are capable of identifying se-mantically meaningful locations SL from GPS data. This involvesextracting the set of semantic locations SL from the GPS trajec-tories ST R, and building the location history H for each user toget the location history set SH = {H1, . . . ,Hm}, where m is thenumber of users.

2. Development of techniques capable of assigning significanceto locations. We use the location histories SH, to compute the top-ksemantic locations ranked according to significance.

3. PROPOSED SOLUTIONFollowing a solution overview in Section 3.1, we cover the ex-

traction of semantic locations in Section 3.2 and the ranking of lo-cations in Section 3.3.

3.1 Solution OverviewWe first extract stay points from the GPS records. We then group

stay points such that each group represents a real-world location.We utilize street addresses, semantics, etc. to improve two existingclustering algorithms, OPTICS [1] and K-means [21].

We then form location histories and use these for constructing atwo-layered graph that models connections between users and lo-cations, and connections between locations. An example of such agraph is given in Figure 1. The graph has a user layer, where nodesrepresent users, and a location layer, where nodes represent loca-tions. An edge exists between two locations if a user has traveledbetween them, and an edge exists between a user and a location ifthe user has visited the location.

Figure 1: Example two-layered graph of users and locations

As discussed in Section 2.2, we use novel indicators of loca-tion significance. Specifically, we propose a new model that en-ables the propagation of significance among locations and the re-inforcement between user authority and location significance usingthe two-layered graph.

Our setting differs from the standard web setting in importantrespects. Specifically, 1) our graph has two levels while the webgraph has one, making it resemble the location-location graph, which

however, is undirected while the web graph is directed; 2) the webgraph is unweighted while we use weights on the location-locationedges (to capture the number of trips between locations; 3) ourgraph is spatial, meaning, e.g., that the willingness of a user totravel a long distance to reach a location indicates high signifi-cance; no such distance notion exists on the Internet; and 4) theuser-location graph is weighted (to capture the durations of visits).

3.2 Extracting Location HistoriesWe cover the extraction of stay points in Appendix B and the

extraction of semantic locations from stay points next.We aim at extracting locations from stay points such that the lo-

cations are meaningful to users. One idea is to apply a reversegeocoder to the stay points. A reverse geocoder attempts to find theclosest addressable location within a certain distance. The resultreturned for a stay point includes the street address and the coordi-nates of the street address. When two stay points are mapped to thesame addressable location, it is likely that they represent the samesemantic location.

However, the reverse geocoding of a stay point involves a callto an external API (in or case, Google Maps), which is very timeconsuming when there are many stay points. Additionally, evenif two stay points are reverse geocoded differently, they may stillrepresent the same semantic location. This suggests that we shouldexploit other information when extracting semantic locations.

We thus extract the semantic locations in two steps. We first clus-ter the stay points. We then enhance the clustering by taking ontoaccount a variety of additional information, to be detailed shortly.A final cluster is expected to represent a single, unique semanticlocation.

For the first step, we use the existing clustering algorithms OP-TICS and K-means. For the second step, we repeatedly enhancethe results obtained by utilizing the street addresses, the seman-tics, and the user visit patterns. We denote the resulting method asSEM-CLS (semantics-enhanced clustering). The method involvestwo steps: split and merge.

In the split step, we sample points from each cluster, reversegeocode them to obtain street addresses, and then obtain their se-mantics using a yellow pages directory; if the sample points in acluster have different semantics, we split the cluster since it maycontain multiple semantic locations.

In the merge step, we merge the clusters that may represent thesame semantic location. In order to determine when to merge twoclusters, we represent each cluster by a four-tuple (~lu, tas, tae, ~ls)

of features: the user list vector ~lu, which contains the IDs of userswho have visited the cluster together with the number of visits byeach user; the average stay duration tas, which is the average du-ration of all the visits; the average entry time (time of a day) tae,which is the average starting time of all the visits to the cluster; andthe semantic list vector ~ls, which contains all the semantics thatmay apply to the cluster.

The visitors, the stay duration, and the entry time reveal patternsof the users who have visited the location represented by a cluster.If two nearby locations exhibit similar visiting patterns, the twolocations are possibly the same. We thus compute the similarityof two clusters based on the four features. First, the vector spacemodel is used to compute the similarity between two visitor listsand also the similarity between two semantic lists. The similarity ofthe average stay duration is computed as the smaller one divided bythe larger one, and the same is done for the average entry time. Thefinal similarity score of two clusters is computed by a simple linearcombination of the four scores. Formally, the similarity betweentwo clusters c1 and c2 is computed as follows:

1011

Sim(c1, c2) =~lu1 · ~lu2

‖ ~lu1‖‖ ~lu2‖+

~ls1 · ~ls2

‖ ~ls1‖‖ ~ls2‖+

min(tas1, tas2)

max(tas1, tas2)+

min(tae1, tae2)

max(tae1, tae2)

(1)

3.3 Ranking of LocationsWe first cover the construction of the two-layered graph. We then

present several baseline location ranking methods in Section 3.3.2,and we present the new location ranking models.

3.3.1 Two-Layered GraphThe two-layered graph consists of two inter-connected sub graphs,

a user-location graph and a location-location graph.DEFINITION 6. User-location graph: The user-location graph

is a weighted undirected bipartite graphGUL = (U,V,EUL,WUL),where U is a set of nodes that represent users, V is a set of nodesthat represent locations, EUL is a set of edges that represent vis-its, and the edge weights WUL describe the numbers of visits to alocation by a user.

Given m users and n locations, we build an m × n adjacencymatrix M forGUL. Formally, M = [vij ], 0 ≤ i < m, 0 ≤ j < n,where vij represents how many times the ith user has visited thejth location.

DEFINITION 7. Location-location graph: The location-locationgraph is a weighted undirected graph GLL = (V,ELL,WLL),where V is as in the previous definition, ELL is the set of edgesthat represent trips, and the weights WLL of edges describe thenumbers of transitions between locations.

Given n locations, we define an n × n adjacency matrix C forGLL. Formally, C = [cij ], 0 ≤ i, j < n, where cij represents thetimes that a user has driven between the ith and the jth location.

Let the location histories of the three users in Figure 1 beHU1 ={L2 → L1 → L4,L2 → L4}, HU2 = {L5 → L2,L5 →L1,L2 → L1, }, HU3 = {L3 → L5}, and HU4 = {L7 →L6,L3 → L6,L2 → L7 → L6}. Then:

M =

1 2 0 2 0 0 02 2 0 0 2 0 00 0 1 0 1 0 00 1 1 0 0 3 2

C =

0 2 0 1 1 0 02 0 0 1 1 0 10 0 0 0 1 1 01 1 0 0 0 0 01 1 1 0 0 0 00 0 1 0 0 0 20 1 0 0 0 2 0

We notice that for each user, the number of visits to the user’shome may be very large. Because we aim at mining hot public lo-cations and to preserve the users’ privacy, locations such as a user’shome are removed. We consider a location as a home location if itis visited frequently by the same user at night.

3.3.2 Baseline MethodsWe consider three baseline methods, two of which are presented

in Appendix C, namely rank-by-visits (which is also used as base-lines in the work by Zheng et al. [23]), and rank-by-durations.

The idea behind the third approach [23], which we cover as abaseline, is to take into account the authority of users instead oftreating all users equally. It is thus assumed that hot locationsare visited by more authoritative users, and that authoritative usersvisit more interesting locations. The approach applies the HITSmodel [11] on GUL to find authoritative users and interesting lo-cations. Using HITS, the approach computes a hub score and anauthority score for each node. The hub scores of users show their

travel experience, and the authority scores of locations show thesignificance of the locations. Additional descriptions of this base-line are given in Appendix C.

Although the approach aims to exploit user authority for locationranking, the model used in the approach limits the performance ofthe approach. If a location is only visited by one user, but manytimes, the HITS algorithm used assigns a very high authority scoreto the location and also a very high hub score to the user. How-ever, a location visited only by a single user or very few users isintuitively not particularly significant. Section 4 offers empiricalinsight into this problem. Recall also that approach considers rawlocations rather than semantic locations.

In the example graph, the final authority vector is {0.1651, 0.2676,0.0661, 0.0935, 0.1286, 0.1675, 0.1117}, and the ranked list is{2, 6, 1, 5, 7, 4, 3}. The effect of weights can be seen moreclearly in the following example: if the user U4 makes the tripL7 → L6 twice, the result becomes very different. The final au-thority vector becomes {0.0506, 0.1557, 0.0946, 0.0304, 0,0404,0.3589, 0.2692}, and the ranked list becomes {6, 7, 2, 3, 1, 5, 4}.Notice that L6 and L7 have larger authority scores, and the user U4

also gains a large hub score improperly, simply because this uservisits her/his preferred locations more times.

3.3.3 Random Walks on GUL

This model exploits reinforcement between users’ travel experi-ences and locations’ significance while avoiding the problem of thelast baseline approach. It uses the randomized HITS model, whichdefines a random walk on GUL.

This model is insensitive to small perturbations and thus is morestable than what is obtained by using the original HITS model. Therandomized HITS model uses a normalized version of matrix M.Bharat and Henzinger [5] show that mutually reinforcing relation-ships between nodes may assign improper weights to links; the nor-malization solves this problem.

We define the column vector wuser as the hub vector and the col-umn vector wloc as the authority vector. Formally, the randomizedHITS model applied to GUL can be described as follows:

wk+1loc = (εMT

row + (1− ε)E1) · wkuser

wk+1user = (εMcol + (1− ε)E2) · wk+1

loc ,(2)

where k is the number of iterations, Mcol is the column stochasticmatrix of M (Mcol is computed by normalizing each column inM), Mrow is the row stochastic matrix of M (Mrow is computedby normalizing each row in M), E1 is a matrix with all elementsequal to 1

m, E2 is a matrix with all elements equal to 1

n, and ε is the

“teleport probability,” which represents the probability of a randomsurfer teleporting from a location node to a user node (resp. from auser node to a location node) instead of following the links inGUL.

We initialize wuser as ( 1m

, ..., 1m

) and wloc as ( 1n, ..., 1

n), and

we then use the power iteration algorithm to calculate the vectorswuser and wloc. The vector wloc is used to rank all the locations.

Assume ε = 0.85 and consider the example graph. If user U4

makes the trip L7 → L6 twice, the final stable authority vector be-comes {0.1374, 0.2148, 0.1003, 0.1003, 0.1390, 0.1730, 0.1351},and the ranked list is {2, 6, 5, 1, 7, 3, 4}. Compared with the resultsof the original HITS [23], L6 and L7 have lower rankings becausethe weights are normalized.

3.3.4 Random Walks on GLL

Unlike the previous method, this method exploits the location-location links: the significance of one location is affected positivelyby the significance of the locations connecting to it.

Inspired by the PageRank algorithm [18] that is developed for

1012

web link graphs, we perform random walks on GLL to rank lo-cations. Our model differs from the model used in the PageRankalgorithm in two respects: 1) weights are associated with our links,while the web link graph has no link weights and 2) the location-location graph is undirected while the web link graph is directed.

We are aware that weights are also explored in other applicationsof the random walk model (e.g., [4]) and that random walks havebeen applied to undirected graphs (e.g., for document summariza-tion [17]). However, no previous work performs a random walk ona spatial graph like the location-location graph.

Using a column vector wloc, the random walk model applied toGLL can be described as follows:

wk+1loc = (αCT

row + (1− α)E) · wkloc, (3)

where Crow is the row stochastic matrix of C defined in Sec-tion 3.3.1 (Crow is computed by normalizing each row in C), E isa matrix with all elements set to 1

n, and α is the well known “damp-

ing factor” (set to 0.85, following the PageRank algorithm [18]).We initialize wloc as a uniformly distributed vector ( 1

n, ..., 1

n)

and then apply the power iteration algorithm until the vector wloc

converges to a stable state.This model takes into account the relations between locations,

but ignores the effects of different users. It thus cannot distinguishbetween visits by more versus less authoritative users.

3.3.5 Unified Link Analysis FrameworkWe propose a new unified probabilistic model that takes into ac-

count both the links between users and locations and the links be-tween locations, as captured by the two-layered graph. No existingapproach is able to accommodate both aspects: the approach basedon randomized HITS model (Section 3.3.3) cannot model the re-lations between locations, while the approach based on PageRank-like model (Section 3.3.4) cannot model the mutual reinforcementbetween users and locations. Additionally, the proposed unifiedmodel is able to incorporate the durations of visits and the distancesbetween locations.

The unified model uses the structure of the two-layered graphto build a Markov chain. The proposed unified model thus can becharacterized as random walks on the Markov chain built on bothGUL and GLL in which the states are nodes. We first present theunified model and then explain the process of random walks.

We define three transition probabilities for the Markov chain:p(Uk|Li), the transition probability to a user node Uk from a lo-cation node Li; p(Li|Uk), the transition probability to a locationnode Li from node Uk; and p(Li|Lj ,Uk), the transition probabil-ity to location node Li from Lj for user Uk.

Given m users and n locations, we capture the transition prob-abilities (i.e., p(Li|Uk)) from user nodes to location nodes in an(m × n) matrix NUL, and, similarly, the transition probabilities(i.e., p(Uk|Li)) from location nodes to user nodes in an (n × m)matrix NLU. We capture the stationary distributions of the ran-dom walks for users and locations by two column vectors wuser

and wloc, respectively, to rank the significance of locations. Theunified model, denoted as Unified, can then be described as:

wk+1loc = NLU · wk

user, wk+1user = NUL · wk+1

loc

p(Uk|Li) = εNum(Uk,Li)

Num(Li)+ (1− ε)

1

m

p(Li|Uk) =

n∑j=1

p(Lj |Uk)p(Li|Lj ,Uk)

p(Li|Lj ,Uk) = αNum(Li,Lj ,Uk)

Num(Lj ,Uk)+ (1− α)

1

n,

(4)

where Num(Uk,Li) is the number of visits toLi byUk and Num(Li)is the total number of visits toLi; p(Li|Lj ,Uk) is the probability ofLi being visited by Uk from Lj ; and Num(Li,Lj ,Uk) is the num-ber of trips between locations Lj and Li by Uk, and Num(Lj ,Uk)is the total number of visits by Uk to Lj . Parameters ε and α arethe “teleport probability” in the random walks, and they controlthe effect of the constant probability 1/m and 1/n, respectively.In turn, these two are used to smooth p(Uk|Li) and p(Li|Lj ,Uk),respectively.

As mentioned, the unified model can be interpreted as a randomwalk model on both GUL and GLL. Thus, at location node Li,the random walk proceeds to user node Uk with the probabilityp(Uk|Li). Intuitively, the transition probability to a user node Uk

reflects the significance of the user to the location based on theuser’s previous behavior, i.e., the frequency of the user visiting Li.At Li the random walk can either follow a link in the graph GUL,or it can teleport to a random user node with probability 1 − ε.Similarly, at user node Uk the random walk proceeds to locationnode Li with probability p(Li|Uk).

To compute p(Li|Uk), we perform random walks on GLL foreach user node Uk. The transition probability of the random walkson GLL from location Lj to location Li for user Uk reflects thetrips by the user between the two locations, and the probability iscomputed by p(Li|Lj ,Uk) in Equation 4. From the stationary dis-tributions of the random walks, we get p(Li|Uk) for each locationLi.

The unified model exploits both user-location and location-locationreinforcement. In fact, the models in Sections 3.3.3 and 3.3.4 arespecial cases of the unified model.

Reduction to the Model in Section 3.3.3. If we disregard thelocation-location reinforcement, i.e., the locations are independentso that p(Li|Lj ,Uk) = p(Li|Uk), we compute the conditionalprobabilities in Equation 4 as follows:


Num(Li)+ (1− ε)

1

m

p(Li|Uk) = εNum(Uk,Li)

Num(Uk)+ (1− ε)

1

n,

where and Num(Uk) is the total number of visits by Uk. In Ap-pendix F, we show that with these definitions, Equation 4 can bereduced to Equation 2.

Reduction to the Model in Section 3.3.4. If we disregard theuser-location reinforcement, i.e., we treat all the users equally, wecan rewrite the probability p(Li|Lj ,Uk) as follows:

p(Lj |Li) = αNum(Lj ,Li)

Num(Li)+ (1− α)

1

n,

where Num(Lj ,Li) is the number of trips between Li and Lj . InAppendix F, we show that in this case, Equation 4 can be reducedto Equation 3.

We proceed to present an algorithm that implements the unifiedmodel. This algorithm uses the power iteration method to computethe stationary distribution of vector wloc that ranks locations. Al-though the ranking is done offline, we show that it is possible toreduce the unified model to a simplified model that enables a moreefficient algorithm.

THEOREM 1. Let P be the n×n ( n is the number of locations)transition matrix of the Markov chain onGUL andGLL. An elementpij of the matrix represents the transition probability between twolocations, and it is computed as:

pij = p(Lj |Li) =

m∑

k=1

p(Uk|Li)p(Lj |Li,Uk) (5)

1013

The unified model in Equation 4 can be reduced to the following:

wk+1loc = PT · wk

loc (6)

PROOF. See Appendix D.An algorithm that utilizes Equation 6 is more efficient than one

that uses Equation 4. The pseudocode of the proposed algorithm isin Appendix E.

Based on the unified model, we proceed to present an extendedmodel, denoted as ST-Unified, that is able to incorporate stay du-rations and distances between locations.

First, each stay has a duration ∆t, and the longer the stay at alocation, the more significant the location is assumed to be. Toaccount for durations in the unified model in Equation 5, we extendthe conditional probabilities p(Uk|Li) that so far only consideredthe numbers of visits:

p′(Uk|Li) =

ε(εNum(Uk,Li)

Num(Li)+ (1− ε)

Dur(Uk,Li)

Dur(Li)) +

1− ε

m

where Dur(Uk,Li) is the duration at Li of Uk, Dur(Li) is the to-tal duration of stays at Li, and ε controls the effect of the durations.

Second, given several equally attractive locations, e.g., conve-nience stores, a user is expected to prefer to travel to the nearestone. Thus, the longer the distance a user is willing to travel toreach a location, the more significant the location is considered tobe. To account for this, we extend the definition of p(Lj |Li,Uk),the probability of user Uk traveling from locationLi to locationLj .If there are trips between Lj and Li in the location history of Uk,the definition becomes:

p′(Lj |Li,Uk) =

α(ηNum(Lj ,Li,Uk)

Num(Li,Uk)+ (1− η)

Dist(Lj ,Li)∑h Dist(Li,Lh)

) +1− α

n,

where Dist(Lj ,Li) is the distance between Lj and Li. The largerthis distance is, the larger the conditional probability becomes. Pa-rameter η controls the effect of distances.

In ST-Unified, the random walks on GUL consider two factors:the stay duration and the reinforcement between visitors and loca-tions. The random walks on GLL consider another two factors: thedistances and the connection between locations.

For implementation, the transition matrix of ST-Unified is de-fined as P′, and its element p′ij is computed as:

p′ij =

m∑

k=1

p′(Uk|Lk)p′(Lj |Li,Uk)

Column vector wloc can then be computed as:

wk+1loc = P′T · wk

loc (7)

THEOREM 2. Both Unified and ST-Unified converge utilizingpower iteration.

PROOF. See Appendix D.

4. EXPERIMENTAL EVALUATIONWe cover the experimental settings and then the results semantic

location extraction and ranking in turn.

4.1 Settings and Evaluation MethodsWe use GPS data collected from 119 cars driven by young drivers

during the period 01/01/2007–31/03/2008. The data set contains inexcess of 0.1 billion GPS records.

Following the approach in Appendix B, we obtain 159,062 staysfrom the data set. After data cleaning, we obtain 76,139 stay points.Details on stay extraction are given in Appendix G.1, and Figure 2shows distribution of a sample of the stay points.

We also use three subsets of the whole dataset for evaluatingthe proposed methods. DATA1: Data restricted to the town ofNørresundby; this dataset contains 352 locations. DATA2: 1,508locations that were visited by at least 5 users. DATA3: Using thedata from DATA2, if a location is visited by a user more than 5times, the number of visits to the location by the user is counted as5. We create these subsets in order to obtain datasets with differentproperties.

All experiments are conducted on a computer with 2.3 GHz CPUand 2 GB main memory; our proposals are implemented using C#.

4.1.1 Extraction of Semantic LocationsAlgorithms: We evaluate the performance of the enhanced methodSEM-CLS (Section 3.2), comparing with the OPTICS and K-meansalgorithms. For OPTICS [1], we use software provided by its au-thors, and we use WEKA1 for the K-means method. We elaborateon how to obtain the ground truth for semantic locations in Ap-pendix G.3.Metrics: We use three metrics, namely entropy, purity [21], andnormalized mutual information (NMI) [15] that are used widely toevaluate the performance of clustering algorithms when a groundtruth exists. The smaller the entropy, the better a clustering methodperforms. For the other two measures, the larger, the better.

4.1.2 Location RankingRanking models: We evaluate the approaches presented in Sec-tion 3.3: the randomized HITS on GUL (U-L), the random walkson GLL ( L-L), the unified model on the two-layered graph (Uni-fied), and the unified model taking into account the durations anddistances (ST-Unified). We compare with the three baseline ap-proaches [23], i.e., rank-by-visits, rank-by-durations, and the HITS-based approach.Ground truth for ranking: We compute the top-50 locations foreach ranking method, combine them, and subject them to expertannotators in order to assess their significance. The locations areshown on a map, and their associated semantics are given to theannotators, who do not know which methods produced which loca-tions.

Four individuals familiar with Nordjylland perform the annota-tion, applying a label from Table 1 to each location. For each loca-tion, we use the average score from all the annotators.

Score Specification2 Very interesting to most people in general and recommended1 An OK location to most people in general0 Neutral to most people in general-1 Not interesting to most people in general and not recommended-2 I have no idea of what it is

Table 1: Annotation specification

Metrics: To capture the performance of our location significanceranking approaches, we apply three popular ranking performancemetrics, namely Mean Average Precision (MAP), Precision@n, andnDCG (normalized Discounted Cumulative Gain). Precision@nis the fraction of the top-n locations retrieved that are significant.When computing MAP and Precision@n, only the locations withscore above 1.5 are considered as significant. nDCG computes therelative-to-the-ideal performance [9].

1http://www.cs.waikato.ac.nz/ml/weka/

1014

Parameter selection: We use the parameters ε in the random walkmodel on GUL and α in the random walk model on GLL. Both areset to 0.85, thus following the PageRank [18] and the randomizedHITS [16] algorithms. In ST-Unified, ε and η are set to 0.3.

4.2 Experimental Results

4.2.1 Extracting Semantic LocationsTo evaluate the grouping of stay points into locations by the dif-

ferent clustering methods, we adjust the parameters of each methodto obtain a number of clusters (and thus locations) that is similar tothe number of ground truth locations.

We set k to 7,100 for K-means, and the initial k mean points areselected randomly. For OPTICS, we set ε to 17 meters and minPtsto 2 (a cluster must have at least two stay points). In the SEM-CLSmethod, threshold th is set to 3. As a result, K-means returns 7,056clusters (a bit below k due to empty clusters), and OPTICS returns7,088 clusters.

We apply SEM-CLS to enhance the results of K-means and OP-TICS. We vary the number of sampling points for each cluster from2 to 5 for SEM-CLS. The results are shown in Tables 2 and 3. The“# Samples” column gives the number of sampling points in eachcluster to be reverse geocoded, and the number of generated clus-ters (locations) is given in the last column.

# Samples Purity Entropy NMI # ClustersK-means 0.8630 0.4770 0.5909 7056

2 0.8743 0.4438 0.6096 60133 0.8852 0.4195 0.6236 63044 0.8990 0.3867 0.6311 65105 0.9096 0.3598 0.6375 6631

Table 2: K-means versus SEM-CLS on K-means with differentnumbers of sampling points

# Samples Purity Entropy NMI # ClustersOPTICS 0.8699 0.4567 0.6526 7088

2 0.8828 0.4292 0.6682 60553 0.8979 0.3847 0.6703 63584 0.9129 0.3573 0.6724 65555 0.9259 0.3139 0.6747 6680

Table 3: OPTICS versus SEM-CLS on OPTICS using differentnumbers of sampling points

We can see that OPTICS performs better than K-means, and wesee that SEM-CLS, when applied to both clustering algorithms, isable to improve on them in terms of all three metrics. This suggeststhat the use of the semantic and visit patterns for the splitting andmerging of clusters in SEM-CLS is effective.

We also observe that SEM-CLS improves as more points aresampled and reverse geocoded. However, calling an external APIto perform reverse geocoding consumes substantial time. Hence,there is a tradeoff between the effectiveness and the efficiency.

Efficiency. K-means takes 12 minutes to finish, and OPTICStakes 4 minutes. The enhancement method SEM-CLS takes lessthan 1 second. Additional details are available in Appendix G.4.

4.2.2 Ranking the Significance of LocationsTable 4 depicts the MAP, Precision@n, and nDCG@n results of

the different ranking models on the whole dataset. Each columncorresponds to an approach. ST-Unified performs the best in termsof all metrics, and Unified performs better than U-L and L-L, andthese all perform better than the three baseline methods.

U-L exploits the mutual reinforcement between the authority ofusers and the significance of locations. L-L uses information on the

number of visits, which is also used by the rank-by-visits method.It outperforms rank-by-visits because it propagates the significancebetween locations.

Unified improves on U-L and L-L because it combines the twographs GUL and GLL while the latter two only consider one sub-graph. ST-Unified performs better than Unified because it exploitsthe combined graph and also the distance between locations andstay durations.

L-L performs worse than the other proposed models. The reasonmay be that it does not capture the authority of users. The modelis also significantly affected by the locations visited many times byvery few users—such locations have large in-degree weights.

The HITS-based approach [23] on GUL performs worse than theother two baselines. Another study [23] shows that it performs bet-ter than rank-by-visits. This may be because the mutual reinforce-ment between users and locations can assign unsuitable weights tolinks according to the analysis of the HITS algorithm by Bharat andHenzinger [5], as discussed in Sections 3.3.2–3.3.3. For example,we find a location visited by only one user, but more than 200 times.As a result, the user has a very high hub score of 0.968, and the top-10 locations returned by the HITS-based approach are all locationsvisited by this user. However, none of them are significant loca-tions. The method rank-by-durations outperforms rank-by-visitsslightly by incorporating visit durations.

The top-10 results of the four proposed ranking models are givenin Appendix G.5.Efficiency. The last row of Table 4 gives the runtimes of the dif-ferent methods. Unified and ST-Unified use power iteration. Whenusing Equation 6, the runtime is about 4 seconds, while it is 130 sec-onds wuen using Equation 4. Recall that location ranking is doneoffline and not at query time. Pre-computed location rankings areutilized to answer top-k queries as described in Appendix A.Results on other datasets. Tables 5–7 shows the results of theHITS-based approach [23] and the proposed methods. We see thatST-Unified consistently performs the best and that Unified per-forms better than U-L and L-L.

HITS U-L L-L Unified ST-UnifiedMAP 0.5424 0.7231 0.6908 0.7664 0.8090P@20 0.5 0.7 0.6 0.75 0.8

nDCG@20 0.8588 0.9269 0.8861 0.9302 0.9545

Table 5: Ranking results using different ranking models on DATA1


nDCG@20 0.8092 0.9290 0.9166 0.9456 0.9501



nDCG@20 0.9202 0.9534 0.9132 0.9552 0.9658


The results in Table 5 on DATA1 that concerns a small regionare similar to the results on the whole data, meaning that the regionsize does not affect the results.

As discussed in Section 3.3.2 and in the coverage of the resultsreported in Table 4, a possible problem of the HITS-based approachis that locations visited many times by very few users are ranked toohighly. To avoid this potential problem, DATA2 does not containlocations visited by very few users. Table 6 shows that DATA2 does

1015

Rank-by-visits Rank-by-durations HITS [23] U-L L-L Unified ST-UnifiedMAP 0.2020 0.2126 0.062 0.3748 0.3020 0.4060 0.4274P@20 0.45 0.45 0.1 0.75 0.6 0.9 0.95P@50 0.36 0.38 0.12 0.68 0.52 0.74 0.76

nDCG@20 0.8261 0.8324 0.4555 0.9411 0.9031 0.9678 0.9897nDCG@50 0.7678 0.7747 0.4380 0.9226 0.8827 0.9402 0.9717

Runtime(ms) 103 107 1536 2209 3540 4234 4318

Table 4: Ranking results using different ranking models on the whole dataset

help the HITS-based approach, although the other models remainbetter.

In DATA2, every location has at least 5 visitors, but a user maystill visit a location many times. For example, we find that inDATA2, one location is visited by one user 101 times while be-ing visited very few times by other users. Thus both the user andthe locations visited by the user obtain very high ranking scores inthe HITS-based approach.

To eliminate such effects, we set for each user who visits a lo-cation more than 5 times the number of visits to 5, thus obtainingDATA3. Table 7 shows that the performance of the HITS-basedapproach is then close to that of the U-L model that is based on therandomized HITS algorithm, although ST-Unified still performsthe best.Parameter Study. ST-Unified uses four parameters: ε, α, ε, andη. The best performance is obtained when ε and α are in the range0.7–0.9 and ε and η are around 0.3. Additional details are availablein Appendix G.6.

5. CONCLUSIONS AND FUTURE WORKMotivated by the proliferation of GPS data, we propose a frame-

work that encompasses new techniques for extracting semanticallymeaningful geographical locations from such data and for the rank-ing of these locations according to their significance.

We model the relationships between locations and also the re-lationships between locations and users with a two-layered graph.The ranking model we propose takes into account significance prop-agation among locations; mutual reinforcement between locationsignificance and user authority; and aspects such as the number ofvisits to a location, the durations of the visits, and the distancesbetween locations.

An empirical study demonstrates that our proposals are capableof extracting semantic locations and of performing better rankingsthan several baseline methods and previous work.

Several promising directions for future work exist. First, it isof interest to study the processing of geo-context aware queries(discussed in Appendix A) based on the hot semantic locations.Second, it is of interest to mine “hot” semantic patterns from GPStrajectories.

AcknowledgmentsThe authors thank the anonymous reviewers for their insightfulcomments. The research was conducted when the authors were em-ployed at Aalborg University, Denmark. C. S. Jensen is an AdjunctProfessor at University of Agder, Norway.

6. REFERENCES[1] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics:

Ordering points to identify the clustering structure. In Proc.SIGMOD, pp. 49–60, 1999.

[2] D. Ashbrook and T. Starner. Learning significant locations andpredicting user movement with GPS. In Proc. ISWC, pp. 101–108,2002.

[3] D. Ashbrook and T. Starner. Using GPS to learn significant locationsand predict movement across multiple users. Personal andUbiquitous Computing, 7(5):275–286, 2003.

[4] A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank:Authority-based keyword search in databases. In Proc. VLDB,pp. 564–575, 2004.

[5] K. Bharat and M. R. Henzinger. Improved algorithms for topicdistillation in a hyperlinked environment. In Proc. SIGIR,pp. 104–111, 1998.

[6] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-kmost relevant spatial web objects. PVLDB, 2(1):337–348, 2009.

[7] C. H. Q. Ding, X. He, P. Husbands, H. Zha, and H. D. Simon.Pagerank: Hits and a unified framework for link analysis. In Proc.SDM, pp. 249–253, 2003.

[8] R. Hariharan and K. Toyama. Project lachesis: parsing and modelinglocation histories. In Proc. Geographic Information Science,pp. 106–124. 2004.

[9] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation ofIR techniques. ACM TOIS, 20(4):422–446, 2002.

[10] J. H. Kang, W. Welbourne, B. Stewart, and G. Borriello. Extractingplaces from traces of locations. Mobile Computing andCommunications Review, 9(3):58–68, 2005.

[11] J. M. Kleinberg. Authoritative sources in a hyperlinked environment.JACM, 46(5):604–632, 1999.

[12] A. Langville and C. Meyer. Deeper inside PageRank. InternetMathematics, 1(3):335–380, 2004.

[13] L. Liao, D. J. Patterson, D. Fox, and H. Kautz. Building personalmaps from GPS data. Annals of the New York Academy of Sciences,1093:249–265, 2006.

[14] J. Liu, O. Wolfson, and H. Yin. Extracting semantic location fromoutdoor positioning systems. In Proc. MDM, p. 73, 2006.

[15] C. D. Manning, P. Raghavan, and H. Schtze. Introduction toInformation Retrieval. Cambridge University Press, 2008.

[16] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for linkanalysis. In Proc. SIGIR, pp. 258–266, 2001.

[17] J. Otterbacher, G. Erkan, and D. R. Radev. Using random walks forquestion-focused sentence retrieval. In Proc. HLT/EMNLP,pp. 915–922, 2005.

[18] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRankcitation ranking: Bringing order to the web. TR 1999-66, StanfordInfoLab, 1999.

[19] M.-H. Park, J.-H. Hong, and S.-B. Cho. Location-basedrecommendation system using Bayesian user’s preference model inmobile devices. In Proc. UIC, pp. 1130–1139, 2007.

[20] F. Schmid and K.-F. Richter. Extracting places from location datastreams. In Proc. UbiGIS 2006, 2006.

[21] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to DataMining. Addison-Wesley Longman Publishing Co., 2005.

[22] K. Yatani, K. Tamura, K. Hiroki, M. Sugimoto, and H. Hashizume.Toss-it: Intuitive information transfer techniques for mobile devicesusing toss and swing actions. IEICE Transactions, 89-D(1):150–157,2006.

[23] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interestinglocations and travel sequences from GPS trajectories. In Proc.WWW, pp. 791–800, 2009.

[24] C. Zhou, N. Bhatnagar, S. Shekhar, and L. G. Terveen. Miningpersonally important places from GPS tracks. In Proc. ICDEWorkshops, pp. 517–526, 2007.

1016

APPENDIXA. APPLICATION SCENARIOS

The extracted top-k hot semantic locations can be used by a rec-ommendation system to fulfill the following two types of recom-mendation queries.

The first kind of query is the location-aware top-k query. It re-turns a ranked list of places according to a user’s location. Forexample, a user staying in a hotel would like to find the top-10significant locations that are close to the hotel.

Formally, a location-aware top-k query is represented by Q ={k, ι}, where k is the number of locations requested, and ι = (x, y)is the query location. The query retrieves the top-k hot locationswith regard to the query location ι.

Clearly, the popularity of locations should be taken into accountto answer such queries. In addition, the interestingness of a locationto a user is affected by the distance from the user’s location (thequery location) to the semantic location. A user is probably willingto visit a less interesting nearby location instead of slightly moreinteresting, but further away location.

Hence, to answer this kind of recommendation query, we takeinto consideration both the inherent significance of a location andits distance to the query location.

A location-aware top-k query is similar to a web query. For aweb query, web pages are ranked by a combination of their inher-ent, relative importance (e.g., as computed by PageRank), and theirrelevance to the query. Here the inherent importance correspondsto the inherent significance of locations, and the relevance corre-sponds to the distance between locations and the query location.

In addition, the extracted hot locations can also be combinedwith the location-aware top-k keyword query [6], which takes intoaccount both the distances of locations to the query point and therelevance of the textual descriptions of locations to the query key-words.

The second kind of query is the context-aware recommendationquery. Recall that a GPS record G contains user information (theuser ID u), temporal information (the timestamp t), and spatialinformation (the coordinates x and y). The concept of a multi-dimensional view of the data from multidimensional databases anddata warehousing can be incorporated into hot-location recommen-dation. In other words, we can take into account a user dimension,a time dimension, and a spatial dimension when providing context-aware recommendations. Example queries include “find the top-30locations visited by persons between the age 20 and 30,” “find thetop-10 locations in May, 2007,” and “find the top-20 locations inthe city center.”

The context-aware recommendation query takes parameters 〈k,us, ts, rs〉. It retrieves the top-k locations from the GPS data thatsatisfy the user-dimension predicate us, the time-dimension pred-icate ts, and the spatial-dimension predicate rs. Obviously, if wedo not impose any context constraints, the context-aware query re-duces to the problem of ranking semantic locations as defined inSection 2.2.

On the user dimension different groups of users have various in-terests, and the hot locations from different user groups reveal thegroup’s preferences. On the time dimension the significance of lo-cations may vary with the time of day since locations have differentopening hours (e.g. bars are usually open at night). Another exam-ple is that the significance of locations may vary across the fourseasons (e.g. outdoor tourist sites may not be visited in winter asfrequently as in summer). On the spatial dimension, the signifi-cance of a location may vary with respect to different regions. If alocation is mostly visited sequentially from other locations within

the same region, this indicates that the location will have high localimportance, but it may not be that interesting in the global view.For example, a community kiosk may display the property.

Note that the two kinds of queries are not orthogonal and can becombined. For example, a query could be “find the top-10 locationsin summer near my hotel.” Similar to a location-aware query, toprocess this query we still need to compute the popularity of loca-tions and the influence of locations to the query location. However,the popularity of locations will be computed by a context-awarerecommend query, i.e., we use only the GPS trajectories generatedin summer to extract and rank hot locations.

B. EXTRACTING STAY POINTSWhen a car is turned off, the GPS-enabled device attached to

the car also stops recording. We utilize this feature to extract staypoints. If the duration ∆t between two consecutive records from auser is larger than a threshold tth, the two records represent a stay.The record that is the end of the previous trip is denoted Gend andthe record that is the start of the next is denoted Gstart.

When a car is started, the car’s GPS device needs sometime tostart working. Hence the ending record best captures the locationof the stay, and we define a stay as P = (Gend, ∆t).

The value of the threshold tth affects the number of extractedstay points. We studied the effect of tth on number of stay pointsusing different for on our data, and we also reverse geocode thegenerated stay points and measured the number of distinct streetaddresses. We found it useful to use a threshold of 10 minutes,which yields some 76,000 stay points and more than 7,000 streetaddresses for our data. This value is also used in previous work [3].

Substantial data cleaning of the raw GPS data is also needed toobtain a robust approach. A number of checks were carried out, andthe approach was generally to repair or discard data when problemswere identified. The specifics are left out for brevity.

C. DETAILS ON BASELINESWe consider three baseline methods:

Rank-by-visits. This method assumes that the more a location isvisited, the more significant it is. This method ranks the locationsaccording to the number of visits by all the users. Ties are brokenby the sum of durations.Rank-by-durations. This method is based on the rank-by-visitsmethod. It takes into account the durations of visits: It ranks the lo-cations according to the sum of durations of all visits to a location.Ties are broken by the number of visits.Zheng et al. [23]. Formally, the approach utilizes the matrix Mfrom Section 3.3.1 to compute an authority score column vector afor the locations, and a hub score column vector h for the users.

ak+1 = MT · hk, hk+1 = M · ak+1 (8)

The locations are ranked by the authority score vector a.

D. PROOFS

D.1 Proof of Theorem 1PROOF. Let p(Uk) denote the kth element in wuser , and p(Li)

denote the ith element in wloc. From Equation 4, we know that:

p(Li) =

m∑

k=1

p(Uk)p(Li|Uk)

=

m∑

k=1

p(Uk)

n∑j=1

p(Lj |Uk)p(Li|Lj ,Uk)

1017

According to the bayesian theorem, we can get:

p(Li) =

m∑

k=1

n∑j=1

p(Uk)p(Lj |Uk)p(Li|Lj ,Uk)

=

m∑

k=1

n∑j=1

p(Lj)p(Uk|Lj)p(Li|Lj ,Uk)

=

n∑j=1

p(Lj)

m∑

k=1

p(Uk|Lj)p(Li|Lj ,Uk)

Since p(Li) =∑n

j=1 p(Lj)p(Li|Lj), we can obtain:

p(Li|Lj) =

m∑

k=1

p(Uk|Lj)p(Li|Lj ,Uk)

After changing the subscript, we have:

p(Lj |Li) =

m∑

k=1

p(Uk|Li)p(Lj |Li,Uk)

This completes the proof.

D.2 Proof of Theorem 2PROOF. Because we add a constant probability to both p(Uk|Li)

and p(Lj |Li,Uk), it is assured that all the elements in P and P′

is larger than zero. Hence both matrices are irreducible. All ele-ments in the two matrices are transition probabilities and are thuspositive. In matrix P, it can be shown that:

n∑j=1

pij =

n∑j=1

m∑

k=1

p(Uk|Li)p(Lj |Li,Uk)

=

m∑

k=1

p(Uk|Li)

n∑j=1

p(Lj |Li,Uk)

=

m∑

k=1

p(Uk|Li) · 1 = 1

Therefore P is row stochastic. Similarly we can also prove that P′

is row stochastic.In summary, P and P

′are irreducible, positive and row stochas-

tic matrices. According to the Markov chain theorem, both the Uni-fied and ST-Unified model will converge using the power iterationalgorithm.

E. ALGORITHM OF THE UNIFIED MODELThe algorithm of the unified model is shown in Algorithm 1.

F. REDUCTION OF THE UNIFIED MODELThe unified model can be reduced to each of the two models

introduced in Sections 3.3.3 and 3.3.4.Reduction to random walk onGUL: If we disregard the location-

location reinforcement, we have p(Li|Lj ,Uk) = p(Li|Uk). Wecan estimate the element p(Uk|Li) in NUL and the element p(Li|Uk)in NLU as:


Num(Li)+ (1− ε)

1

m

p(Li|Uk) = εNum(Uk,Li)

Num(Uk)+ (1− ε)

1

n

Under this estimation, we can see that:

Algorithm 1 UnifiedModel (SH, m, n)Input:SH, location histories of all users, m, the number of users, n, thenumber of locationsOutput:wloc, the column vector containing ranking scores of all locations1: M[m][n] ← NewMatrix()2: N[m][n][n] ← NewMatrix()3: P[n][n] ← NewMatrix()4: initialize all the elements in wloc as 1

n5: initialize all the elements in M[m][n] and N[m][n][n] as zero6: for each user’s location history Hk in SH do7: if the ith location is visited by the kth user then8: M[k][i] ← M[k][i] + 19: if the trip is from the jth location then

10: N[k][j][i] ← N[k][j][i] + 111: BuildMatrix(P, M, N)12: do power iteration on P until wloc reaches the stationary state.13: return wloc

NLU = εMTrow + (1− ε)E1

NUL = εMcol + (1− ε)E2

Therefore, Equation 4 is exactly the same as Equation 2.Reduction to random walk on GLL: If we disregard the user-

location reinforcement, we treat all the users’ trajectories as onepseudo “user.” Now we have:

p(Li|Lj ,Uk) = p(Li|Lj) = αNum(Lj ,Li)

Num(Li)+ (1− α)

1

n

Notice that p(Li|Lj) is the element of P. Hence

P = αCrow + (1− α)E

We can see that Equation 6 is equal to Equation 3. Since we haveproved that Equation 6 is simplified from Equation 4, Equation 4can be reduced to Equation 3.

G. ADDITIONAL EXPERIMENTALSETTINGS AND RESULTS

G.1 Extracting staysAs explained in Appendix B, we mark two consecutive GPS

records as the starts and end of a stay if the time duration betweenthem exceeds ∆t=10 minutes. This yields 159,062 stays in ourdata set. After data cleaning, we obtain 76,139 stay points. Thesepoints are located in the region 56◦ ∼ 58◦ North, 9◦ ∼ 11◦ East,which is the region of Nordjylland in Denmark. Figure 2 depictsthe distribution of a sample of the stay points.

G.2 Reverse Geocoding and ObtainingSemantics

The Google Maps API is invoked for reverse geocoding. Given alatitude and a longitude, Google Maps returns an addressable loca-tion that is nearest to the query position. The returned informationcontains the street address and the coordinates. An online yellowand white pages directory in Denmark (http://www.degulesider.dk/)is used for finding the semantic of a given street address. The yel-low pages service returns a list of semantic locations that are nearthe query street address and one of which may match exactly thegiven query street address.

The coordinates used in the raw GPS data are in the UTM (Uni-versal Transverse Mercator coordinate system) format. To performreverse geocoding, we convert the data from the UTM format to

1018

Figure 2: Distribution of stays in our GPS data

latitude and longitude. The method introduced by Salkosuo 2. isused to perform the conversion. The conversion uses WGS (WorldGeodetic System) 84 specification.

G.3 Ground Truth of Semantic LocationsWe build a ground truth of semantic locations to evaluate the

clustering methods. We first reverse geocode each stay point us-ing the Google Maps API. Given a stay point P with coordinate(x, y), Google Maps API will return the street address of the pointtogether with the coordinate (x′, y′) of this street address. We callthe returned coordinate the reverse-coordinate to distinguish it fromthe coordinate of the stay point. Note that a street address maycorrespond to multiple reverse-coordinates, particularly when thereturned street address does not contain a street number. We as-sume that a semantic location corresponds to one or several distinctreverse-coordinates.

We group stay points such that each group has a distinct reverse-coordinate. We then check whether some groups should be mergedto represent a semantic location: We compute the distance of allpairs of groups using their reverse-coordinates; If the distance be-tween two groups is less than 100 meters, we ask annotators tocheck whether they are the same semantic location and should bemerged. Finally, we obtain 7,082 semantic locations.

G.4 Efficiency of Semantic LocationExtraction

K-means takes 12 minutes to finish, and OPTICS takes 4 min-utes. The enhancement method SEM-CLS takes 390 ms when thenumber of sampling points is 2, and 420 ms when the number ofsampling points is 5 if we assume that all the points have been re-verse geocoded by calling external API beforehand. The runtimeof reverse geocoding depends on the stability of network, and theavailability and workload of external API service. Sometimes wehave to call the external API service several times to get the result.Hence, it does not really make sense to report the runtime of theexternal API service.

G.5 Top-10 Results of Four Ranking ModelsFigure 3 shows the top-10 results of the four ranking models

proposed in Section 3.3. The harbor location is detected by the L-Lmodel (the top right corner). It does not have a large number ofvisitors, but has a relative large number of visits. The hospital loca-

2http://www.ibm.com/developerworks/java/library/j-coordconvert/index.html

tion (in the middle) is detected by the U-L model. This is becausemany users have been there, but not too many times. The two uni-fied models detect both locations. The top 10 semantic locationsreturned by the four methods are listed in Table 8.

(a) U-L (b) L-L

(c) Unified (d) ST-Unified

Figure 3: Top-10 results of the four ranking models

U-L L-L Unified ST-Unified1 Bilka Bilka Bilka Bilka2 Føtex Stadium Church Church3 Church Harbor Stadium Stadium4 Cinema Føtex Føtex Føtex5 Kommune unknown Hospital Zoo6 Hospital Church Cinema Hospital7 Stadium Cinema Zoo Cinema8 Train station Doense Dybfrost Harbor Kommune9 unknown Gas station Bauhaus Harbor

10 Bauhaus unknown unknown Bauhaus

Table 8: Top-10 results of the four models

Bilka and Føtex are big supermarkets in Aalborg. Bauhaus andDoense Dybfrost are two companies, and their annotation scoresare smaller than 1.5.

G.6 Parameter Study for the Ranking ModelsThere are four parameters in the ST-Unified model, i.e., ε, α, ε

and η. ST-Unified performs the best when ε and α are in the range0.7–0.9 and ε and η are around the value 0.3.

Parameters ε and α are used to control the effects of the constantprobability (to teleport to other nodes). Parameter ε is to controlthe importance of the stay duration at a location, and parameterη controls the importance of the distance between two locations.They affect the effectiveness of the ST-Unified model.

Figure 4 shows the effect of varying ε. As we can see, the bestranking performance occurs when the value of ε is in the range0.7–0.9. By setting ε to zero, we treat all the users equally, andthus ST-Unified works like the L-L model; by setting ε to 1, wedisregard the teleport probability in the random walk on GUL. Atboth extremes, the performance becomes worse.

1019

Figure 5 shows the effect of varying α, and we can see that theperformance is the best when α is in the range 0.7–0.9. When weset α to zero, all the transition probabilities become the same, andST-Unified works like the U-L model; and when setting α to 1, wedisregard the teleport probability in the random walk on GLL.

0.2

0.3

0.4

0.5

0.1 0.3 0.5 0.7 0.9

MA

P

ε

Figure 4: MAP by varying ε

0.2

0.3

0.4

0.5

0.1 0.3 0.5 0.7 0.9M

AP

α

Figure 5: MAP by varying α

Figure 6 and Figure 7 shows the effect of varying η and varyingε, respectively. We can see that only if we combine the considera-tion of the number of visits, the durations and the distances will weachieve the best results.

When η and ε are set as zero, ST-Unified will become Unified.When ε is set to a large value, the duration plays a more importantrole than the visiting times in estimating the conditional probabil-ity p(Li|Uk). We can see that the performance is slightly worse.When η is set at a large value, the distance is more important forestimating p(Lj |Li,Uk). This leads to very poor performance be-cause the distance alone is not a good indicator of the transitionprobability of a random surfer between two locations.

0.1

0.2

0.3

0.4

0.5

0.1 0.3 0.5 0.7 0.9

MA

P

η

Figure 6: MAP by varying η

0.2

0.3

0.4

0.5

0.1 0.3 0.5 0.7 0.9

MA

P

ε

Figure 7: MAP by varying ε

H. ADDITIONAL RELATED WORKExtracting locations from GPS trajectories: There has been

a host of studies on extracting locations from single users’ GPStrajectories [3, 8, 10, 13, 20, 24]. There is also recent work [23]on extracting locations from multiple users’ trajectories. Someworks [8, 10, 20] do not consider the re-occurrence of GPS read-ings at the same location, and thus they do not need to cluster theGPS points into locations; other works make use of clustering togroup GPS records to get locations [3, 23, 24]. However, no worksconsider the semantics of locations when extracting locations.

Liu et al. [14] consider the extraction of semantic locations, butthe study is based on a single user’s data and does not cluster similarpoints. In contrast, our work handles multiple users’ trajectories,and a semantics enhanced clustering method is used for extractingsemantic locations.

Mining important locations: Zhou et al. [24] mine personal im-portant locations from a single user’s trajectories. The importanceis a personal view and is “defined” by the user, such as the home orwork office of the user. They do not rank locations, but only classifylocations as important or not. In contrast, we rank locations basedon location histories of multiple users.

Zheng et al. [23] mine interesting locations and travel sequencesfrom GPS data, and a HITS-based inference model is used for rank-ing locations. Our work differs from that work in that 1) we con-sider semantics and use semantic to enhance the location extractionprocess; 2) we propose new models for ranking locations, which arecapable of better exploiting the features of GPS trajectories.

Several locations recommender systems [19,22] can recommendlocations to users based on real-world data. Our proposed tech-niques can also serve as a location recommender system.

PageRank and HITS. These are two popular link based rank-ing algorithms that were originally developed for web link analysis.Both methods rank pages according to their importance and author-ity, estimated by the importance of pages pointing to them. Theycan be understood as as a Markov chain in which the states arepages and the transition probabilities are determined by the linksbetween pages; the PageRank problem actually amounts to solv-ing an old problem (computing the stationary vector of a Markovchain) in the context of web links [12].

We note that Ding et al. [7] attempt to find a framework to unifythe PageRank and HITS models. The unification is to establish aconnection between the two models such that it becomes possibleto use a simple count of the number of inlinks to a webpage as anapproximation of its PageRank. However, the unified model workson a single-layered web link graph, not the two-layered graph inour problem.

1020

mining signiﬁcant semantic locations from gps datavldb2010/proceedings/files/papers/r… ·...

Documents