predict user in-world activity via integration of map ...urbcomp2013/urbcomp2015/... · 3.1.1 map...

9
Predict User In-World Activity via Integration of Map Query and Mobility Trace Zhengwei Wu Baidu Research, Big Data Lab [email protected] Haishan Wu * Baidu Research, Big Data Lab [email protected] Tong Zhang Baidu Research, Big Data Lab [email protected] ABSTRACT People often resort to map search engine or other location- based services for location information when planning long trips or local navigation, and their map queries as well as mo- bility trace will be accumulated and stored in user log. These data offers valuable information for studying the mechanism of human mobility pattern, furthermore, map query data en- able us to sense users’ real-time interests towards locations, and even to forecast their in-world activity in the near fu- ture. In this paper, we unveil the connection between users’ map queries and their in-world explorations, and prove the pre- dictability of query-activity formation using two large-scale datasets, the complete map query log and mobility traces of 4 million Baidu map users, which comprise of 118 million map queries and 6.5 billion GPS location records during consecutive 3 months. To the best of our knowledge, it is the first attempt to extensively assess the unique qualities of map query data and predict whether queries about one location would actually lead to in-world visits using hetero- geneous data sources. We first characterize the properties of these two datasets, then extract interesting features to quan- tify their correlation, finally we construct gradient boosting model for prediction, and describe applications empowered by our findings, such as mobility modeling and urban flow estimation. General Terms Human Mobility, Urban Computing, Spatial-temporal Data Analysis, Query Log Analysis Keywords Human Mobility, Urban Computing, Map Query Log anal- ysis 1. INTRODUCTION * corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. UrbComp’15 August 10, 2015, Sydney, Australia. Copyright is held by the author/owner(s). Pervasive usage of location-based services and smart phones around the globe has contributed to vast and rapid accumu- lation of geolocation data. Imagine the following scenario: you search for the location of a new restaurant from Google maps on your smartphone and plan to meet a friend there; two hours later you drive from office to the restaurant via navigation service provided by Google maps; you then share current location by checking-in at the restaurant through Twitter and FourSquare app. After dinner, you issued a query about the airport and hotel information in a new city where you intent to go for a trip in the forthcoming vaca- tion. Meanwhile, your map queries and GPS coordinates are captured by these application providers with your consent. As a result, huge amount of map query and trajectory records were produced. In China, Baidu, the largest Chinese search engine, also possesses a dominant advantage in mobile map applications and share over 60% in the domestic mobile map market. Our research is supported by the large-scale datasets collected from anonymous Baidu mobile app users. Such geocoded map queries imply users’ strong intent of vis- iting the queried location, while the captured mobility traces characterize the mobility pattern and preference of histori- cally visited places[11][20]. Integrating and mining such two data sources has a great potential for various commercial applications. Previous researches show that mobile users tend to perform location search queries when they plan to visit more unfamiliar places[18]. If we are able to validate whether the user visits the searched places by checking his- torical traces, we could model the relationship between lo- cation queries and mobility behavior. Furthermore, we are not only able to predict future mobility from location his- tories, but also anticipate users likelihood of visiting new places after user issued map queries. This is fundamental to many location-based services such as travel package recom- mendation[12], precise customer targeting [15] and predic- tive search[7] etc. In this paper, we emphasize on investgating the unique prop- erties of map query data, and assessing its strength to af- fect in-world activities, via integration with mobility trace data. We made three contributions in this work. First, we model the connection between users map query intents and in-world location exploration, on two large-scale datasets which comprise of 4 million Baidu mobile map users during 3 month as shown in Figure 1. To the best of our knowledge, it is the first time to predict mobility behavior by combin-

Upload: others

Post on 11-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Predict User In-World Activity via Integration of Map Queryand Mobility Trace

Zhengwei WuBaidu Research, Big Data [email protected]

Haishan Wu∗

Baidu Research, Big Data [email protected]

Tong ZhangBaidu Research, Big Data [email protected]

ABSTRACTPeople often resort to map search engine or other location-based services for location information when planning longtrips or local navigation, and their map queries as well as mo-bility trace will be accumulated and stored in user log. Thesedata offers valuable information for studying the mechanismof human mobility pattern, furthermore, map query data en-able us to sense users’ real-time interests towards locations,and even to forecast their in-world activity in the near fu-ture.In this paper, we unveil the connection between users’ mapqueries and their in-world explorations, and prove the pre-dictability of query-activity formation using two large-scaledatasets, the complete map query log and mobility traces of4 million Baidu map users, which comprise of 118 millionmap queries and 6.5 billion GPS location records duringconsecutive 3 months. To the best of our knowledge, it isthe first attempt to extensively assess the unique qualitiesof map query data and predict whether queries about onelocation would actually lead to in-world visits using hetero-geneous data sources. We first characterize the properties ofthese two datasets, then extract interesting features to quan-tify their correlation, finally we construct gradient boostingmodel for prediction, and describe applications empoweredby our findings, such as mobility modeling and urban flowestimation.

General TermsHuman Mobility, Urban Computing, Spatial-temporal DataAnalysis, Query Log Analysis

KeywordsHuman Mobility, Urban Computing, Map Query Log anal-ysis

1. INTRODUCTION∗corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee.UrbComp’15 August 10, 2015, Sydney, Australia.Copyright is held by the author/owner(s).

Pervasive usage of location-based services and smart phonesaround the globe has contributed to vast and rapid accumu-lation of geolocation data. Imagine the following scenario:you search for the location of a new restaurant from Googlemaps on your smartphone and plan to meet a friend there;two hours later you drive from office to the restaurant vianavigation service provided by Google maps; you then sharecurrent location by checking-in at the restaurant throughTwitter and FourSquare app. After dinner, you issued aquery about the airport and hotel information in a new citywhere you intent to go for a trip in the forthcoming vaca-tion. Meanwhile, your map queries and GPS coordinates arecaptured by these application providers with your consent.

As a result, huge amount of map query and trajectory recordswere produced. In China, Baidu, the largest Chinese searchengine, also possesses a dominant advantage in mobile mapapplications and share over 60% in the domestic mobilemap market. Our research is supported by the large-scaledatasets collected from anonymous Baidu mobile app users.

Such geocoded map queries imply users’ strong intent of vis-iting the queried location, while the captured mobility tracescharacterize the mobility pattern and preference of histori-cally visited places[11][20]. Integrating and mining such twodata sources has a great potential for various commercialapplications. Previous researches show that mobile userstend to perform location search queries when they plan tovisit more unfamiliar places[18]. If we are able to validatewhether the user visits the searched places by checking his-torical traces, we could model the relationship between lo-cation queries and mobility behavior. Furthermore, we arenot only able to predict future mobility from location his-tories, but also anticipate users likelihood of visiting newplaces after user issued map queries. This is fundamental tomany location-based services such as travel package recom-mendation[12], precise customer targeting [15] and predic-tive search[7] etc.

In this paper, we emphasize on investgating the unique prop-erties of map query data, and assessing its strength to af-fect in-world activities, via integration with mobility tracedata. We made three contributions in this work. First, wemodel the connection between users map query intents andin-world location exploration, on two large-scale datasetswhich comprise of 4 million Baidu mobile map users during3 month as shown in Figure 1. To the best of our knowledge,it is the first time to predict mobility behavior by combin-

Page 2: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Figure 1: Left: visualization of mobility trajectories of 8751 users. Right:visualization of map queryOD(origin-destination) network of 8751 users. Blue color indicates the location where the query was is-sued, red color indicates the queried location, and curves indicate the queried locations that were actuallyvisited.

ing such two heterogeneous data sets. Second, We extractextensive query-activity features to characterize the correla-tion between online location query and offline in-world visit.Finally using extracted features as inputs, we build gradientboosting model for prediction task.

The rest of paper is organized as following: first we sur-vey through recent work on human mobilty, geocoded querymining and urban computing in Section 2, and elaborate thecontribution of our work; afterwards we discuss two large-scale datasets, i.e., map query data and mobility trace data,then discuss how to integrate them for analysis in Section 3.We address the details concerning how to extract importantfeatures, then propose and evaluate gradient boosting modelfor query-activity prediction task in Section 4 and 5, respec-tively. Finally, we conclude our findings and discuss futurework and promising applications enabled by our research.

2. RELATED WORKFrom theoretical perspective, Song et al. [14] quantified thepredictability of human mobility using three different en-tropy measures, showing that 93% of human mobility behav-iors are predictable. Numerous models have been proposedto predict future human mobility, but few have incorporatedusers’ map query data as additional source.

Search engine is the gateway to Internet information, andmassive search queries are capable of crowdsourcing real-time mass interests and help forecast future trends of a vastpopulation. Many applications have utilized Google searchqueries to predict economy and disease outbreaks.[3] Teevanet al.[17] provided a detailed analysis on users mobile lo-cal search behavior, showing that local searches were highlycontextual. West et al.[18] performed a thorough analysison geocoded query logs and introduced a statistical modelto infer the interplay of location queries and users’ in-worldactivities. Their research shows that geocoded queries holdvaluable information for user future movement prediction.Yang et al.[19] mined geocoded mobile queries and appliedtheir model on predicting users future visit to healthcare

utility.These models, as the authors admitted in [18], were not ableto quantify the factors that influence visit-after-query behav-ior, as they lack the ground truth of actual movement to val-idate whether user did visit the queried locations. Besides,compared with the geocoded data used in the work men-tioned above [10] [17] [18] [19], location and routing querieson mobile map service that we used here consist of both ge-ographic and POI description, thus reflecting users’ intentstowards future movement in a more direct way.There is also some work worth noting which studies individ-ual mobility patterns or urban dynamics using geotaggedsocial media data.[5] [21]

3. DATA PROCESSING AND INTEGRATIONThe whole framework is illustrated in Figure 2. In this sec-tion we focus on the data processing and data integrationpart.

3.1 Data processing3.1.1 Map Query

When a user searches for a place or plans a route usingBaidu map app, the query keyword, i.e. the destination, anduser’s current location will be recorded and stored to Baidu’sdatabase server provided the user’s consent. We here col-lected more than 118 million historical map queries of 4 mil-lion users over 3 months since January 1, 2015, that is, 29map queries per user on average. Each map query record hasthe following information: anonymized user ID, timestamp,latitude and longitude of the location where the query wasissued, query keyword, property of point of interest (POI)which was resolved from query keyword, including POI ad-dress, category, detail information. Besides, when user plansa route, the location of specified destination, origin and se-lected transportation mode including walking, driving, andtransit will also be recorded. About 10% map queries thatcannot be resolved into valid GPS coordinates mainly dueto erroneous user input, are discarded during preprocessing.We observed that in route planning, most cases are that theorigin is the location where the map query is issued, while

Page 3: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Figure 2: Diagram of proposed system

almost 10% of the origin direct to other places rather thanuser’ current location. The distribution of average origin-destination distance of different transportation mode alsofollows power law distributions while different transporta-tion modes exhibit different long tails.

3.1.2 Mobility TraceWe collected mobility trace data of the same 4 million Baidumobile map users, while each trajectory records 3-month du-ration historical locations captured by Baidu mobile geolo-cation SDK since January 1, 2015. Location data are storedon the server if the user consents to the privacy policy ofBaidu, and each record contains hashed user ID, timestamp,latitude, longitude, reverse geocoded address and locationprecision radius. Specifically, user’s location may be deter-mined by various means including GPS, wireless networkspot or cell tower and the average accuracy radius is lessthan 60 meters. Incorrect location records due to networkfailure were removed during preprocessing.

We then identify the stay locations from historical mobil-ity trace data. First, we classify mobility mode into threestates, that is, stay, pause and move, based on movementdistance and time interval between two consecutive locationrecords. Afterwards we extract the location records withstate of stay, and employ DBSCAN[6], a efficient density-based spatial clustering algorithm, to identify user’s staypoints. It performs best in our experiments compared withother classic clustering methods like MeanShift algorithm[4]and K-Means algorithms. Finally, each user has 20 staypoints on average, and the number of stay points range from1 to 158, influenced by user’s mobility regularity level andGPS record density. We determine user’s home and work-place leveraging on the visit freqency and hour-of-day dis-tribution of each stay point, generally the top two most fre-quently visited stay points correspond to user’s home andworkplace.

3.2 Data integration

To quantify the relationship between location queries onmap and actual movement in mobility traces, one essentialstep is how to integrate them together. We first establishedcorrespondence between them via hashed user id. We set themaximum time interval and minimum distance threshold ofquery-activity to be 15 days and 200 meters, respectively.In other words, we assume a queried location is actuallyvisited by the user if there is at least one location recordin user’s next 15-day mobility trace matched at a distanceless than 200 meters. Figure 5 shows the distributions ofdistance from the location where a map query was issuedto the queried location where the user indeed visited after-ward. As observed, about 20% of OD(origin-destination)distances are above 100km, indicating that people tends toplan their routes in advance for long distance travels. An-other fact that we observed is that people usually querieda general keyword like ’Beijing’ or ’airport in Beijing’ be-fore they plan a travel in a new city. To better handle suchcases, the minimum distance threshold is designed to scalewith origin-destination (OD) distance, and here we set thescaling factor to be 0.05, i.e. when user queries a location1000 km away, the query-trace match is established if ODdistance is less than 50km. Thus, 40-50% user map queriescan be confirmed as actually visited afterwards by match-ing with their future mobility traces, the acutal percentagevaries due to different levels of trace density, since some usersmight issue queries about a place and visit there later, butwe could not observe this through sparse user traces. Figure3 illustrate the spatial-temporal relationship between mapqueries and mobility traces.

4. CHARACTERIZE MAP QUERY AND MO-BILTY TRACE

4.1 Query-Activity Relationship4.1.1 Time Lag between user’s map query and in-

world activityWe quantify the relationship between map query data andmobility trace data after integration as detailed in previ-ous section. Figure 4 shows the distribution of time interval

Page 4: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Figure 3: Queries and trajectory timeline of a user’sjourney during October, 2014: blue straight linesindicate mobility trajectories and curves indicatethe queried routes. Each query, which contains thequery timestamp and POI category of queried loca-tion, are labeled on the location where it was issued.We see that his query OD lines are always one-stepahead of his in-world trajectory.

between the time user searched for the place and the timehe/she visits there. Not surprisingly, it is well character-ized by heavy-tail distribution: about 25% matched querylocations are visited instantly within 30 minutes, and over50% within 2 hours, while 81% are within one day and 95%within one week. One interesting finding of this distributionis that there is a local extrema at around 6 and 24 hours.One possible interpretation is that if a trip takes more than6 hour (the longest flight in domestic China is less than6 hour), people tends to visit this place tomorrow or evenlater.

Figure 5 shows the distributions of distance between querieddestination and user’s home/workplace, respectively. Wefound that over 50% queried locations are confined within50 km away from users’ home or workplace, while there is nosignificant discrepancy between home and workplace(54.5%and 51.1%,respectively). By contrast, when we comparedthe frequency of map queries that performed in home andworkplace, we found that the frequency of home-based queriesis more than 2 times higher than that of workplace (23% and9.4%, respectively). This indicates that users tend to navi-gate for new places when they are at home or nearby.

4.1.2 Influence of Query TypeWe then quantified the influence of route planning service onquery-activity behavior. Intuitively, querying direction in-formation indicates stronger intention of visiting the queriedlocation. Figure 6 demonstrates that route planning serviceobviously increases visiting probability, while the influencesof three transportation modes (e.g. walking, transit anddriving) provided by mobile map are different. Interestingly,users are more likely to visit the queried location after check-ing walking direction than transit or driving. This might bebecause the user is much closer to the destination in suchcase.

6 hours 24 hours

Figure 4: Distribution of the time lag between users’map queries and their actual in-world visits. It isworth noticing that there is a local extrema around6 hours and 24 hours.

Figure 5: Left: Distribution of distance from users’home and workplace to the destination location oftheir queries. Right: Distribution of distance fromusers’ home and workplace to the origin location oftheir queries.

Figure 6: Visiting probability of queried locationafter using route planning service

Page 5: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Figure 7: POI category rankings of map query des-tinations. Top 2 are transportation and restaurant,respectively.

4.2 Point-Of-Interst Information of Map QueryAnother factor that could influence such relationship is thePOI information of queried locations. Most queried loca-tions in our dataset are classified into 14 POI categoriessuch as transportation, restaurant, hotel, etc. As for the re-mained 20% queried locations without category information,we first employ a CRF-based algorithm[13] to segment thekeywords into tokens, then vectorize the tokens into one-hotencoding features, and use them as input to train a multi-label SVM classifier, which can automatically assign POIcategories. This simple method can achieve overall accu-racy of 92% when test on the labeled query records. Wecomputed the percentage of each category for the queriedlocations which were actually visited by users. As shown inFigure 7, top 2 categories, of which the percentage are signif-icantly higher than other types and almost account for 36%of all matched queries, are transportation (such as airport,railway stations etc.) and restaurant. It is also interest-ing that different categories of queries often exhibit distinctdynamics when compared under hour-of-day time interval,e.g. in Figure 8, restaurants are most queried during din-ing time, and hospitals are most queried at the begining ofdaily reception hour. Therefore we later use queries’ POIcategory information of queried location as a feature to feedour prediction model.

4.3 Query SessionsExtensive research has discovered that a variety of humanbehaviors express bursty nature and can be well-characterizedby heavy-tailed distribtion. As we observed in users’ mapquery log, one or several location queries within a shorttime often follow a long period of inactivity. Therefore itis reasonable to view a user’s historical map queries as con-secutive sessions, and monitor their characteristics and in-session/inter-session interactions.We apply 30 minutes as time window to separate a series ofqueries into sessions. If time interval between two consecu-tive queries are shorter than 30 minutes, they are identifiedas the same session, otherwise would fall into different ones.We also record important features of each query session, suchas number of query items, session duration and inter-sessiontime.

(a) In-session time(log hour) (b) Inter-session time(log)

Figure 9: Property of Query Sessions

Map queries are much more concise than general searchqueries, because they express user’s instant and direct in-terest towards locations. User would in often times rephrasequery keywords in order to obtain better results from searchengine, and these queries natually form a query session.

One interesting fact is users’ repeated query behavior, whichmostly occurs within one session and can be classified intothree types: the first one is similar as what observed ingeneral search engine[16], users frequently rephrase their lo-cation queries in short time period to obtain the optimalquery feedback, such as trying full name or abbreviation forthe same place; And the second one, user intends to visitone or certain categories of locations, and search for all theplaces that could suffice his/her current needs, then chooseone among them to visit after comparison; The third oneis that users usually search for the same location on Baidumap before leaving for the destination (for planning newroute) or being very close to the destination (for checkingthe exact location). Such behavior indicates that the user isvery inclined to visiting this place, or is already on the way.Intuitively, the more frequent users repeatly search for onelocation, the higher probability that he/she will visit thisplace in the near future.

According to our observation, 37.7% sessions consist of onlyone map query and have 0 session time length. For otherquery sessions, the session time length ranges from a few sec-onds to 8 hours, and is around 13 minutes on average. Theoverall distribution follows a long-tail distribution, as shownin Figure 9, Left, there is a minor gap near 30 minutes(10e-0.3 hour) due to the inter-session time window setting. Andthe inter-session time length distribution shows a periodicpeak almost every 12 hours, indicating temporal regularityin query behavior.

4.4 Feature extractionBased on the discoveries above, we extract the following fea-tures for query-activity prediction.

4.4.1 Map query featuresMap query features are defined as following:

• Query session features: We segment user queries intoconsecutive query sessions via a inter-query time win-dow of 30 minutes, and calculate session size(numberof in-session queries), session time length and inter-

Page 6: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

(a) Restaurant (b) Transportation (c) Hotel (d) Hospital

Figure 8: Queries of different POI categories exhibits distinct temporal dynamics in Hour-of-Day popularity:(a): daily popularity of restaurant queries peaks at 1am, 12pm, 18pm, mostly during dining time. (b): dailypopularity of traffic queries are highest after midnight, when other needs, such as food and shopping, becomedorment. (c): daily popularity of hotel queries are higheest during night time. (d): daily popularity ofhospital queries peaks at 8am, when most hospitals in China begin to receive patients.

session time length, to depict the property of this ses-sion.

• Query OD distance: this is a vital factor that affectsuser’s decision of visiting a location or not.[18]. In ourvalidation, we observed that the average OD distanceof visited queries is significantly shorter than that ofnon-visited query locations.

• Distance between query origin, destination and home,workplace, the nearest stay point. As perceived, alarge portion of map query requests occurs close tousers’ stay points(home, workplace), but once users de-viate from their familiar daily routes, they tend to issuequeries more frequently to navigate their surroundings.And when the user has a list of candidate locations inmind, the nearest place often stands out, either near-est to the current location, or nearest to the locationshe/she is already familiar with(stay points).

• Query type: search, walking, driving or transit. Theyrepresent different transportation mode and have showndistinct query-activity matching percentage.

• Query index and number of matched queries in thepast: context information contains features about user’spast query behavior, and generally when a user is-sued queries about many places and visited them af-terwards, he/she is more likely to repeat this behavioron current query session.

4.4.2 Mobility featuresWith mobility trace and calculated stay points in hand, wecompute the following features for query-activity prediction:

• Hour-of-day movement distance: we compute the av-erage daily movement distance for each hour-of-day,extreme cases, such as a single long trip that exceeds100 km, are excluded during calculation.

• Hour-of-day GPS records percentage: we compute howa user’s GPS records are distributed on hour-of-day,since the generation of GPS records indicates that theuser is currently using Baidu map, this feature helpsdescribe the variations of user’s daily activity level.

• Radius of gyration: a widely-used measure for quanti-fying the typical range of a user’s trajectory.

• Commute distance: most people’s daily movementsare formed of home-work commutes, and commute dis-tance represents their normal movement and radius ofgyration level.

• Travel frequncy: frequency of long travels that exceed100 km, this feature is included to complement theinformation of hour-of-day movement.

• Temporal features: Human mobility presents partic-ularly high temporal regularity, i.e. people tend tostay at certain locations during a specific time period.Meanwhile, it is perceived that the probability of visit-ing queried locations with long distance during week-end or holiday is much higher than that of weekdays.Thus we should include temporal information of thequery as features: hour of day, day of week, week-day/weekend/holiday.

4.4.3 POI features• POI category: transportaion, restaurant, real estate,

shopping, district, education and university, enterpriseand industry zone, travel spots, hotel, entertainment,hospital, government, life services, finance and banks.As discussed above, POI category characterizes thefunction of queried location and reflects user’s purposeof visit, therefore can be a valuable input in query-activity prediction.

• POI query number, visit number, and popularity rank-ing: based on each POI location’s total historical fre-quency of being queried and historical visits, we calcu-lated the popularity ranking of the queried location.

• POI category matched ratio: the ratio of visit numberand query number towards each category of POI loca-tion. e.g. hotel POIs have the highest ratio of 52%.

5. PREDICT IN-WORLD ACTIVITY VIA MAPQUERY

5.1 DatasetsOur datasets consists of two parts: users’ aggregated mapquery log and their mobility traces. First we have 4,080,283Baidu map app users who had issued map queries in Bei-jing City on January 1, 2015, afterwards we collect and in-tegrate their entire map query log and mobility traces un-

Page 7: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Table 1: Query OD City PercentageModel name query number precentage

Beijing->Beijing 64,278,560 54.2%Beijing->Others 3,407,144 2.9%Others->Beijing 9,242,424 7.8%Others->Others 37,890,702 35.1%

til March 31, 2015, altogether 90 days. Finally when in-valid map queries and location records are excluded, we areable to conduct quantitative analysis using 118,671,390 mapqueries and 6,493,688,641 user GPS location records, on av-erage 29.1 map queries and 1591.5 GPS records for each user,respectively. According to our validation with user mobilitytracse, 42.6% map queries lead to users’ in-world activity at4,994,373 unique POI locations throughout China.

Among all the query records, 57.26%(67,953,813) are issuedfrom Beijing City, since we start data collection process fromBeijing local users, and 61.95%(73,520,984) query destina-tions are within Beijing.

All query and mobility data are collected with users’ content,and their IDs in our experiments are indexed and profile in-formation omitted to guarantee anonymity. Compared withsimilar data sets used in other related work, our data is sig-nificantly more vast and consecutive, therefore reducing thecommon limitation of data scarcity problem in characteriz-ing human mobility pattern.

5.2 Gradient Boosting Model for PredictionMap query data provides rich information to infer futuretravel intents. As shown in section 4, we investigated therelationship between map query and in-world visit behavior,and extracted a set of related features to characterize suchbehavior. Taking these features as inputs, we here buildprobabilistic models to predict that in which new place theuser will appear and when.

5.2.1 Problem Formulation:Given user’s current location query, historical mobilty trace,and historical map query log, including:

{(t0, O0, D0), · · · , (tc, Oc, Dc)}

where O,D denote origin location and queried destinationlocation, respectively, our goal is to infer whether user wouldactually visit Dc in the future, i.e. user being observed lessthan 200 meters from Dc within next 15 days’ mobility tracerecords.

5.2.2 Model Settings:We model this task as a binary classification problem, andour goal is to predict whether the user will visit a place afterhe issued queries on Baidu map. We propose to use gradi-ent boosting classifier [8], a powerful additive model thathas shown great success in a variety of academic problemsand industrial applications.

Here we resort to XGBoost(eXtreme Gradient Boosting),an open source gradient boosting library which also provides

Table 2: Query-activity Prediction Model Perfor-mance

Features Error Rate AUCQuery Features 0.206 0.841

Mobility Features 0.218 0.830POI Features 0.226 0.817

All 0.189 0.869

an optimized distributed version, implemented by [2]. It hasbeen widely adopted in many data mining competitions, andis best-known for its outstanding performance and efficiencyfor training.

Two most important parameters in the model, the maxdepth of boosted trees and the number of trees, are tunedto 8 and 20, respectively. Another useful features is thatXGBoost supports probability values in [0,1] as label. Sincethe validation of query-activity is not definite, i.e. matchedqueries may be only passed-by instead visited, and unmatchedqueries may be just unable to confirm due to trace sparisity,we therefore assign probabilistic labels 0.99 and 0.01 to pos-itive and negative samples, respectively.

In order to avoid overfitting and ensure model’s robustnessfrom possible errors in extracted features and query records,we evaluated our model via 5-fold cross validation.

5.3 Experiments5.3.1 Evaluation metrics

We use test error(logistic loss) and AUC score(area underthe ROC curve) as metrics to quantify the performance ofboosting model, and the predictability of user’s in-world ac-tivity based on his/her map query and historical mobilitytrace features.All features are grouped into 3 categories, and are used as in-put separately to further evaluate their influence on model’sprediction accuracy. We also calculate the detailed impor-tance score for each feature in the complete training model.

5.3.2 ResultsAs shown in Table 2, on query-activity prediction task, ourmodel can achieve test error of 0.189 and AUC score of 0.869using all the features as training input. We think it convinc-ing that query-activity formations are predictable at individ-ual level.

5.3.3 Feature Importance:All the features used in our model were extracted based onour observation as detailed in Section 4. We here evaluatedthe importance of each feature as demonstrated in Figure10.

• Query features: Query features prove to be the mostimportant factor on determining user’s in-world visitbehavior. This verifies our empirical assumption men-tioned in Section 4 that the more frequently one user is-sued queries about the same location, the higher prob-ability that the user will visit the queried place in

Page 8: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

Figure 10: Feature importance rankings in query-activity prediction model

the future. Besides, the time interval between repeatqueries also expresses the intensity of user’s interesttowards visiting the queried location.

Origin-destination distance and home-destination dis-tance are also useful features. As described in Section4, the OD distance of visited queries are significantlyshorter than unvisited ones.

• Mobility features: Mobility features such as theHourOfDay movement distance, number of HourOf-Day location records, daily movement distance, radiusof gyration and travel frequency account for a largeportion of model performance in total. This indicatesthat future mobility behavior, including novelty seek-ing behavior, is also greatly influenced by one’s dailymobility patterns.Temporal features including hour-of-day, day-of-week,holiday are less important than query features and mo-bility features. One possible reason is that noveltyseeking behaviors are much more diverse in temporalrange although human mobility exhibits temporal reg-ularity.

• POI features: POI category information of queriedlocation, not surprisingly, also plays an important role.This is because the category information indicates thefunction of queried location and is therefore directlyrelated with people’s in-world activities.

6. CONCLUSIONS AND FUTURE WORKIn this paper, we integrate large-scale map query data andmobility trace data from Baidu map users, to characterizethe interplay between users’ queries and their in-world ac-tivity. After study of important traits of map query andmobility trace, we build gradient boosting model to pre-dict future in-world activity of users, based on their currentquery and well-designed features, then evaluated the exper-iment results and assessed the contribution of each feature.Our research has great potentials for a large spectrum of

applications ranging from POI recommendation, accuratecustomer targeting, to urban planning and modeling.

Next Place Prediction: Most existing work on next placeprediction focus on modeling people’s mobility regularityand inferring next location based on their historical tran-sitions between stay points.[14][9] Though it yields goodperformance, but this mechanism fails to incorporate thenovelty-seeking behavior of people. Therefore map queries,which best reflect people’s location demands and intereststowards novel places, can offer valuable additional informa-tion of users’ future in-world activities, and further enhancethe performance of traditional next place prediction models.

Urban Flow Estimation: Most flow estimation methodssolely rely on historical or real-time collective user mobilitytraces as input, then assess the density zones and trafficflow[1]. This kind of system is sufficient in many scenaios,and capable of flow estimation under fine time granularity,but it is not quite robust when event or anomoly occurs.For example, when a public stadium is about to host a bigevent tonight, unusual amount of map queries about thislocation would emerge and soon be detected as the eventtime approaches, even though the real-time mobility flowin this region has only shown very little fluctuations. Withthe boost of map query information, we can easily sense andquantify the massive location demands from the crowd, thuscalibrate and forecast urban mobility flow beforehand.

7. REFERENCES[1] F. Alhasoun, A. Almaatouq, K. Greco, R. Campari,

A. Alfaris, and C. Ratti. The city browser: Utilizingmassive call data to infer city mobility dynamics.

[2] T. Chen and T. He. xgboost: extreme gradientboosting. 2015.

[3] H. CHOI and H. VARIAN. Predicting the presentwith google trends. Economic Record, 88:2–9, 2012.

[4] D. Comaniciu and P. Meer. Mean shift: A robustapproach toward feature space analysis. PatternAnalysis and Machine Intelligence, IEEE Transactionson, 24(5):603–619, 2002.

[5] J. Cranshaw, R. Schwartz, J. I. Hong, and N. M.Sadeh. The livehoods project: Utilizing social media tounderstand the dynamics of a city. In ICWSM, 2012.

[6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Adensity-based algorithm for discovering clusters inlarge spatial databases with noise. In Kdd, volume 96,pages 226–231, 1996.

[7] T. Fiatal. Predictive content delivery, Jan. 2 2009. USPatent App. 12/348,136.

[8] J. H. Friedman. Greedy function approximation: agradient boosting machine. Annals of statistics, pages1189–1232, 2001.

[9] S. Gambs, M.-O. Killijian, and M. N. delPrado Cortez. Next place prediction using mobilitymarkov chains. In Proceedings of the First Workshopon Measurement, Privacy, and Mobility, page 3. ACM,2012.

[10] Q. Gan, J. Attenberg, A. Markowetz, and T. Suel.Analysis of geographic queries in a search engine log.

Page 9: Predict User In-World Activity via Integration of Map ...urbcomp2013/urbcomp2015/... · 3.1.1 Map Query When a user searches for a place or plans a route using Baidu map app, the

In Proceedings of the first international workshop onLocation and the web, pages 49–56. ACM, 2008.

[11] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi.Understanding individual human mobility patterns.Nature, 453(7196):779–782, 2008.

[12] A. Majid, L. Chen, G. Chen, H. T. Mirza, I. Hussain,and J. Woodward. A context-aware personalized travelrecommendation system based on geotagged socialmedia data mining. International Journal ofGeographical Information Science, 27(4):662–684,2013.

[13] F. Peng, F. Feng, and A. McCallum. Chinesesegmentation and new word detection usingconditional random fields. In Proceedings of the 20thinternational conference on ComputationalLinguistics, page 562. Association for ComputationalLinguistics, 2004.

[14] C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi.Limits of predictability in human mobility. Science,327(5968):1018–1021, 2010.

[15] J. Steenstra, A. Gantman, K. Taylor, and L. Chen.Location based service (lbs) system and method fortargeted advertising, Aug. 31 2004. US Patent App.10/931,309.

[16] J. Teevan, E. Adar, R. Jones, and M. A. Potts.Information re-retrieval: repeat queries in yahoo’slogs. In Proceedings of the 30th annual internationalACM SIGIR conference on Research and developmentin information retrieval, pages 151–158. ACM, 2007.

[17] J. Teevan, A. Karlson, S. Amini, A. Brush, andJ. Krumm. Understanding the importance of location,time, and people in mobile local search behavior. InProceedings of the 13th International Conference onHuman Computer Interaction with Mobile Devices andServices, pages 77–80. ACM, 2011.

[18] R. West, R. W. White, and E. Horvitz. Here andthere: goals, activities, and predictions about locationfrom geotagged queries. In Proceedings of the 36thinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 817–820.ACM, 2013.

[19] S.-H. Yang, R. W. White, and E. Horvitz. Pursuinginsights about healthcare utilization via geocodedsearch queries. In Proceedings of the 36th internationalACM SIGIR conference on Research and developmentin information retrieval, pages 993–996. ACM, 2013.

[20] Y. Ye, Y. Zheng, Y. Chen, J. Feng, and X. Xie.Mining individual life pattern based on locationhistory. In Mobile Data Management: Systems,Services and Middleware, 2009. MDM’09. TenthInternational Conference on, pages 1–10. IEEE, 2009.

[21] N. J. Yuan, F. Zhang, D. Lian, K. Zheng, S. Yu, andX. Xie. We know how you live: exploring the spectrumof urban lifestyles. In Proceedings of the first ACMconference on Online social networks, pages 3–14.ACM, 2013.