Transcript
Page 1: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

Hierarchical Bayesian Nonparametric Approach to Modeling and Learning theWisdom of Crowds of Urban Traffic Route Planning Agents

Jiangbo Yu†, Kian Hsiang Low†, Ali Oran§, Patrick Jaillet‡

Department of Computer Science, National University of Singapore, Republic of Singapore†

Singapore-MIT Alliance for Research and Technology, Republic of Singapore§

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA‡

{yujiang, lowkh}@comp.nus.edu.sg†, [email protected]§, [email protected]

Abstract—Route prediction is important to analyzing andunderstanding the route patterns and behavior of trafficcrowds. Its objective is to predict the most likely or “popular”route of road segments from a given point in a road network.This paper presents a hierarchical Bayesian non-parametricapproach to efficient and scalable route prediction that canharness the wisdom of crowds of route planning agents by ag-gregating their sequential routes of possibly varying lengthsand origin-destination pairs. In particular, our approach hasthe advantages of (a) not requiring a Markov assumptionto be imposed and (b) generalizing well with sparse data,thus resulting in significantly improved prediction accuracy,as demonstrated empirically using real-world taxi routedata. We also show two practical applications of our routeprediction algorithm: predictive taxi ranking and routerecommendation.

Keywords-wisdom of crowds; crowdsourcing; sequentialdecision making; route prediction; intelligent transportationsystems; hierarchical Dirichlet and Pitman-Yor process

I. INTRODUCTION

Knowing and understanding the route patterns and be-havior of traffic crowds has grown to be critical to thegoal of achieving smooth-flowing, congestion-free traffic[1], especially in densely-populated urban cities. Of centralimportance to analyzing the traffic route patterns andbehavior is that of route prediction whose objective is topredict the most likely or “popular” route of road segmentsfrom a given point1 in a road network. Such a capabilityis useful in many real-world traffic applications such asstudying drivers’ route preferences, enhancing their situa-tional awareness, supporting drivers’ safety and assistancetechnologies, among others (see Section V). In practice,it is often not possible to make perfect route predictionsbecause (a) the quantity of sensing resources (e.g., loopdetectors) that can be allocated to measure traffic flow in adense urban road network is cost-constrained, and (b) theactual traffic flow is usually unknown or difficult to for-mulate precisely in terms of other traffic-related variablessince it varies depending on the driver types (e.g., truckor taxi/cab drivers, tourists) and road conditions/features(e.g., tolls, congestion, traffic lights, speed limit, roadlength, travel time). In the absence of accurate traffic flowinformation to determine the most popular route(s) exactly,we propose to rely on crowdsourcing for route prediction

1Besides a given origin, a destination is sometimes also specified (seeSection V-B).

in this paper, specifically by consulting the wisdom oftraffic crowds.

The proliferation of low-cost GPS technology has en-abled the collection of massive volumes of spatiotemporaldata on large crowds of human-driven vehicles such astaxis/cabs, from which we can exploit to learn aboutthe patterns of their planned routes to be used for routeprediction. To elaborate, though these human agents havedifferent levels of expertise and criteria affecting theirindividual ability to compute and plan routes, consensuspatterns can generally be observed among their plannedroutes in real-world traffic networks [2], [3]. If the crowdsof route planning agents are sufficiently large, these con-sensus patterns potentially approximate the actual mostpopular routes well, the latter of which are not availableas mentioned earlier. Hence, we will be able to harnessthe wisdom of such traffic crowds for learning the routeprediction capability.

The “wisdom of crowds” phenomenon refers to theobservation that the aggregation of solutions from a groupof individual agents often performs better than the ma-jority of individual solutions [4]. Traditionally, such aphenomenon has been widely applied to simple pointestimation of continuous-valued physical quantities [4]and discrete labeling [5], [6]. In contrast, our challengehere is to aggregate more complex structured outputs (inparticular, sequential decisions) from the crowds of routeplanning agents for performing route prediction. Thereare several non-trivial practical issues confronting thischallenge:• Real-time prediction and scalable learning. For the

consensus patterns to closely approximate the actualmost popular routes, a huge amount of traffic data (i.e.,planned routes) is needed to filter out the noise and vari-ation arising from different agents’ individual planningabilities. Furthermore, prediction has to be performedin real time in order to be useful for the applicationsdescribed above. How then can an aggregation modelbe built to learn scalably and predict efficiently?

• Sequential nature of routes. An agent’s history of tra-versed road segments can potentially affect its futureroute. For example, suppose that two agents originatingfrom different locations share some common route andsubsequently split up to go to different destinations.This example raises an important trade-off question:How much history is needed to predict the future route

Page 2: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

accurately? Existing literature have assumed a Markovproperty [7], [8], that is, the future route of an agentdepends only on its current road segment. This is clearlyviolated in the example; longer route histories have tobe considered in order to predict the future routes ofthe two agents correctly.

• Data sparsity. Though the amount of traffic data ishuge, it may only sparsely populate a dense urbanroad network. As a result, many road segments arenot traversed by any agent. How can an aggregationmodel still be built to predict accurately when the datais sparse?

This paper presents a hierarchical Bayesian non-parametric approach to efficient route prediction (Sec-tion III) that, in particular, can exploit the wisdom ofcrowds of route planning agents by aggregating theirsequential routes of possibly varying lengths and origin-destination pairs. It resolves the above-mentioned issuesin the following ways:• Incremental/Online learning. Learning is made scalable

in an incremental/online fashion. In this manner, our ag-gregation model can be trained with a small amount ofdata initially and updated as and when new training dataarrives. We show that our incremental/online learningalgorithm incurs linear time in the size of training data(Section III-D) and can satisfy the real-time requirement(Section IV-D).

• Non-Markovian property. Our route prediction algo-rithm does not require imposing a Markov assumption(Section III-C). Consequently, the future route of anagent can depend on the complete history of traversedroad segments. Without imposing a Markov assumption,the prediction accuracy can be improved considerably,as demonstrated empirically using real-world taxi routedata (Section IV-B).

• Smoothing. Smoothing (Section III-D) is an effectiveway to deal with the issue of sparsity. The key ideais to add some pseudo agents and make them travelalong road segments that are not traversed by any agentin the traffic data. As a result, the prediction accuracycan be significantly improved, as shown empirically inSection IV-C.

Section V shows two practical applications of our routeprediction algorithm: predictive taxi ranking and routerecommendation.

II. BACKGROUND

A. Notations and Preliminaries

Definition 1 (Road Network): The road network G =(V,E) is defined as a directed graph that consists of a setV of vertices denoting intersections or dead ends and aset E of directed edges denoting road segments.

Definition 2 (Route): A route R of length i is defined asa walk in the road network G, that is, R , (r1, r2, . . . , ri)where the road segments rk, rk+1 ∈ E are adjacent fork = 1, . . . , i− 1.

B. Problem Formulation

Recall from Section I that the objective of route pre-diction is to predict the most likely future route given ahistory of traversed road segments. We achieve this in aprobabilistic manner: Let the route R′ denote the sequenceof road segments to be traversed in the future, that is,R′ , (ri+1, . . . , rm) for some m ≥ i+ 1. Given a set Dof agents’ routes of possibly varying lengths and origin-destination pairs, the most likely future route conditionedon a past route R is derived using

maxR′

P (R′|R,D) .

It is computationally challenging to compute the posteriorprobability P (R′|R,D) for every possible past route Rbecause the number of possible past routes is exponentialin the length of history of traversed road segments. Toreduce this computational burden, one can decrease thelength of history by imposing a Markov property:

P (R′|R,D) = P (R′|Rn,D) (1)

where Rn denotes a reduced route of R consisting of onlythe past n road segments. That is, Rn , (ri−n+1, . . . , ri)for 1 ≤ n ≤ i and R0 is an empty sequence. Note that nis usually known as the order of Markov property.

C. Maximum Likelihood Estimation

A simple widely-used approach to model the posteriorprobability P (R′|Rn,D) is that of maximum likelihoodestimation (MLE) [7]. Let cRn and cRn·R′ denote, respec-tively, the count of Rn and the count of Rn concatenatedwith R′ observed within the routes in the dataset D.So, cRn

=∑R′ cRn·R′ . Then, P (R′|Rn,D) (1) can be

approximated using MLE by

PMLE(R′|Rn,D) =

cRn·R′

cRn

.

In practice, MLE often performs poorly, especially whenthe data is sparse. If a route R′ has not been observedwithin the routes in the dataset D, then PMLE(R

′|Rn,D) =0, which may not be the true value of P (R′|Rn,D) (1).It may be due to sparse data that such a route has notbeen observed. In reality, it may still be associated with asmall probability in the true distribution. Researchers fromlinguistic community have developed smoothing methodsto resolve this issue. We will discuss them next.

D. Smoothing Methods

To tackle sparsity, the simplest form of smoothingmakes a prior assumption that every possible future routeR′ is observed once more than it already has. Then, thissmoothing method approximates P (R′|Rn,D) (1) by

PS(R′|Rn,D) =

cRn·R′ + 1

cRn+ |R′| (2)

where R′ is the set of all possible future routes R′ ofthe same length. It can be observed from (2) that everyroute R′ has at least a small probability 1/|R′| of beingtraversed even if it is not observed in the dataset (i.e.,

Page 3: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

cRn·R′ = 0). Hence, this smoothing method mitigates theissue of sparsity.

To improve prediction accuracy, a more advancedmethod called interpolated smoothing can be used. It per-forms a weighted average of smoothing probabilities (2)with different orders of Markov property and is formulatedas follows:

PI(R′|Rn,D) = λnPS(R

′|Rn,D) +(1−λn)PI(R′|Rn−1,D)

where λn ∈ [0, 1] is a fixed weight applied to the n-order smoothing probability. To understand why inter-polated smoothing can work well, supposing Rn cannotbe observed within the routes in dataset D but someshorter routes Rm(m < n) within the route Rn can,it is reasonable then to involve smoothing probabilitieswith these lower m-order Markov property to yield moreaccurate prediction.

Another advanced method called interpolated Kneser-Ney (IKN) smoothing estimates the probability of thefuture route by discounting the count cRn·R′ by a fixedweight d and interpolating with lower-order probabilities:

PIKN(R′|Rn,D) =

max(0, cRn·R′ − d)cRn

+d tncRn

PIKN(R′|Rn−1,D)

(3)

where tn , |{R′|cRn·R′ > 0}| denotes the number ofdistinct future routes R′ after traversing the past route Rnin the dataset. The role of tn is to make the probabilityestimates sum to 1. Modified Kneser-Ney (MKN) smooth-ing generalizes IKN smoothing by making the discountingweight depend on the length of Rn. Interested readers arereferred to [9] for more details. In the next section, wewill describe our route prediction algorithm that exploitssome form of smoothing to resolve the issue of sparsity.

III. SEQUENTIAL MODEL FOR AGGREGATING ROUTES

A. Non-Markovian Aggregation Tree

Since a Markov property is not imposed, we needto address the challenge of aggregating and storing anenormous set D of agents’ sequential routes efficiently.To achieve this, our sequential model utilizes the datastructure of a suffix tree2 [10] which we call a non-Markovian aggregation tree (Fig. 1). Specifically, everypossible prefix of an agent’s sequential route in D whenreversed is represented by some path in the tree startingfrom its root. Consequently, the tree can aggregate thesequential decisions within the routes of all the agents.Since the tree’s height is determined by the length ofthe longest route, the tree can maintain information onall past sequential decisions made within the routes ofall agents, thus preserving the non-Markovian property.As an illustration (Fig. 1), supposing there are two routesR1 = (r3, r12, r5, r1) and R2 = (r6, r3), every possibleprefix of R1 and R2 when reversed is represented as apath in the tree. For example, after updating the tree (i.e.,

2A suffix tree of a set of strings is a compacted trie of all suffixes ofall the strings.

r5

GRn−1

r3r12

b b b b b b

...r6 r1 r3

GRn

r5 r6 r7 r12

r9r10r11

r13

GR

NR

r14 ..

b b b

r15 r16 r3r3r12

Figure 1: Non-Markovian aggregation tree (refer to textfor more details).

assuming that the tree has been updated using R1) withR2, two solid nodes corresponding to (r6) and (r6, r3) areadded.

To perform route prediction, the key step is to learnthe posterior probability P (·|R,D) for every possible pastroute R found in D and store them efficiently. To simplifythe notation, let GR , P (·|R,D). As shown in Fig. 1,GR is stored in the node NR that is indexed by a contextrepresenting some past route R. A node’s context can beformed by appending the edge labels along the path backto the root node; it therefore appears as the suffix of itsdescendants’ contexts.

The tree construction algorithm is shown in Algo-rithm 1. The tree is constructed dynamically from thetraining data in O(|D|`2) time and O(|D|`) space where` is the length of longest route in the dataset D.

Function UpdateTree(Tree,R)Input: Current tree Tree, newly observed route

R = (r1, . . . , ri)Output: Updated tree Treefor k = 1 to i do

Search Tree for the node NR′ of context R′ thatshares the longest suffix with R̃ , (r1, . . . , rk)if R′ is suffix of R̃ then

Insert NR̃ as a child of NR′else

R′′ ← longest suffix of both R̃ and R′

Replace NR′ with NR′′Insert NR′ and NR̃ as children of NR′′

endendreturn Tree

Algorithm 1: Tree construction algorithm.

B. Bayesian Nonparametric Model

In this subsection, we will describe how to infer GRfrom the dataset D. Following the Bayesian paradigm,GR is defined as a random variable such that a priordistribution can be placed over GR before observingdata. After observing data, GR is updated to a posteriordistribution. Prediction can then be performed by samplingfrom the posterior distribution. Formally, a nonparametric

Page 4: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

prior called Pitman-Yor process [11] is placed over GRwith prior/base distribution H:

GR ∼ PY(d, α,H) (4)

where d ∈ [0, 1) and α > −d are discount and concen-tration parameters, respectively. Both d and α control theamount of variability around the base distribution H .

The Pitman-Yor process can be better understood byinterpreting it as a Chinese Restaurant Process (CRF)[11]: The analogy is that a sequence (x1, x2, x3, . . .) ofcustomers go to visit a restaurant which has an unboundednumber of tables. The first customer sits at the firstavailable table while each of the subsequent customers sitsat the k-th occupied table with probability proportional to(ck − d) where ck is the number of customers alreadysitting at that table, and he sits at a new unoccupied tablewith probability proportional to (d t + α) where t is thecurrent number of occupied tables. From the CRP, it canbe observed that a Pitman-Yor process has a useful rich-get-richer property: If more people sit at one table, thenmore people will sit at it. Using the CRP interpretation,the predictive/posterior distribution for the next customercan be derived as follows:

P (xc+1 = k|x1, . . . , xc) =ck − dc+ α

+d t+ α

c+ αH(k) (5)

where c =∑tk=1 ck. Note that when little data is available,

prior knowledge H plays a significant role. But, afterwe have enough data, the predictive probability dependsprimarily on the data and becomes less affected by theprior distribution. Furthermore, (3) is similar to (5), whichexplains why IKN smoothing is an approximate version ofCRP. We adapt the same idea to perform route predictionby treating each customer as a road segment and extendit to the sequential case, as discussed next.

C. Hierarchical Bayesian Nonparametric Model

Inspired by the hierarchical structure of the aggregationtree (Section III-A), we extend the Pitman-Yor process to ahierarchical model. To do this, we start by placing the basedistribution H in (4) as a prior at the root node of the tree.Pitman-Yor process priors can then be placed recursivelyin each node along the path descending from the root ofthe tree as follows: Recall that Rn denotes a reduced routeof R consisting of only the past n road segments. LetGRn−1

be the prior/base distribution for GRn(see Fig. 1),

that is, GRn∼ PY(dn, αn, GRn−1

). We can recursivelydo this to derive the hierarchical version of the Pitman-Yorprocess:

GRn∼ PY(dn, αn, GRn−1

)GRn−1 ∼ PY(dn−1, αn−1, GRn−2)

. . .GR0

∼ PY(d0, α0, H)

where the parameters dn and αn depend on the length ofRn, R0 is the empty route with length 0, and H is often setto be a uniform distribution. Such a hierarchical structureis inherently consistent with that of the aggregation tree.

The key insight behind the hierarchical structure is thatthe predictive distribution GRn

of a node NRnuses the

predictive distribution GRn−1of its parent node NRn−1

as its prior/base distribution. Hence, this structure en-codes the prior knowledge that the predictive distributionsGRn , GRn−1 , . . . , GR0 with routes of different lengths aresimilar to each other. In particular, predictive distributionswith routes sharing longer suffixes are more similar to eachother. As a result, our hierarchical model can interpolateand promote sharing of statistical strength between predic-tive distributions with higher-order (i.e., longer route) andlower-order (i.e., shorter route) Markov properties. Whenthere is no data for a route Rn, our model considers shorterroutes Rm(m < n) that appear in the training data byremoving the earliest road segments from Rn first. Thismakes sense if we assume that the earliest road segmentsare least important to determining the next road segment,which is often the case.

D. Model Inference

Exact inference over Pitman-Yor processes is in-tractable. Instead, we employ an approximate inferencemethod called the Lossless Compression Sequence Mem-oizer (LCSM) [12] which is well-suited for real-time,scalable learning. Different from [12], the input to ourmodel is not a single long sequence, but a set of agents’routes. Hence, we modify it to be updated after observingeach new agent’s route.

The main idea of the inference procedure is to viewPitman-Yor process in terms of CRP. The distribution ofPitman-Yor process is captured by the counts {ck, t} ofCRP (5). The predictive/posterior probability can be com-puted by averaging probabilities over all samples of counts{ck, t} which is often performed by the computationallyintensive Markov chain Monte Carlo sampling method.LCSM simply updates the counts information for a newlyobserved route and removes the resampling step for thecounts. It can be viewed as a particle filter with only oneparticle. For example, supposing there is a newly observedroad segment R′ = ri+1 after traversing the route Rn, thecounts information along the path from the root node NR0

to the node NRn is updated. Let tRn·R′ be the numberof tables labeled with R′ in the restaurant Rn in termsof the CRP representation. Then, the predictive/posteriorprobability is given by

P (R′|Rn,D)=

cRn·R′−dntRn·R′

cRn+αn+

dntRn+αn

cRn+αnP (R′|Rn−1,D)

if cRn·R′ >0,

P (R′|Rn−1,D) otherwise(6)

where tRn=

∑R′ tRn·R′ . It is important to note that when

the parameters αn and dn are set to 0, (6) reduces to thatof MLE without smoothing [7] (Section II-C). We willempirically compare the prediction accuracies of LCSMvs. MLE in Section IV-C.

As the model is updated with each new route, it mayincur more space. However, coagulation and fragmentationproperties of the hierarchical Pitman-Yor process [13]

Page 5: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

make it possible to maintain constant space overhead.Learning can be done in O(|D|`2) time. Since the length` of the longest route is constant, the time complexity islinear in the size of training data D.

IV. EXPERIMENTS AND DISCUSSION

A. Traffic Data

In this section, we show experimental results basedon real-world taxi route data in Singapore. Our datais provided by the Comfort taxi company. The data iscollected in the following way: The Comfort taxi companyhas over 10, 000 taxis equipped with GPS roaming in thecity. The GPS location, velocity, and status informationof these taxis are sent to and stored at a central server.The dataset is collected in the whole month of August,2010. But, the GPS data is noisy and sampling frequencyis relatively low. The data cannot be used directly since itis in GPS format. Hence, the GPS data must be mappedonto routes of road segments. We apply a map matchingtechnique [14] to preprocess the data and assume that themap matching result is perfect. After preprocessing theGPS data, we obtain a set of taxi routes such that eachroute’s origin and destination correspond to the pick-upand drop-off locations, respectively. In this paper, we focusmainly on routes that are associated with the Bugis area.The reason for choosing Bugis is because of its dense roadnetwork and high volume of taxi traffic which make theroute prediction problem even more challenging.

B. Prediction Accuracy

In the first experiment, we evaluate how the predictionaccuracy varies in terms of varying length of history oftraversed road segments and length of future route tobe predicted. The size of training data is 6000 routes.Fig. 2 shows the results averaged over 300 test instances.Fig. 2a reveals that increasing the length of history oftraversed road segments improves the average predictionaccuracy. When the history increases beyond the lengthof 10, there is no significant improvement. The predictionaccuracy stabilizes at around 85%. This implies the ordern of Markov property can be set to be 10 with little orno performance loss. In turn, the maximum depth of theaggregation tree can be limited to 10, which significantlyreduces the space overhead. Fig. 2b shows that the pre-diction accuracy decreases almost linearly in the length offuture route to be predicted. For predicting future routesof length 10, our model can achieve 50% accuracy.

The second experiment evaluates the prediction accu-racy with increasing size of training data. Our sequentialmodel is learned incrementally with 100 routes each time,and its resulting prediction accuracy is shown in Fig. 3.It can be observed that the prediction accuracy stabilizesafter the model has learned from about 5000 observedroutes.

The third experiment evaluates how the prediction accu-racy changes at different times of the day. In the previousexperiments, the training data covers the whole day. In thisexperiment, the training and test data are obtained from

0 10 20 30 400

0.2

0.4

0.6

0.8

1

Length of history of traversed road segments

Pre

dic

tio

n a

ccu

racy

(a)

0 2 4 6 8 100.4

0.5

0.6

0.7

0.8

0.9

Length of future trajectory to be predicted

Pre

dic

tio

n a

ccu

racy

(b)

Figure 2: Graphs of average prediction accuracy vs. (a)length of history of traversed road segments and (b) lengthof future route to be predicted: The latter result is basedon using the history of traversed road segments of length10.

0 2000 4000 6000 8000 10000 120000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

No. of trajectories in training data

Pre

dic

tio

n a

ccu

racy

Figure 3: Graph of prediction accuracy vs. size of trainingdata D (i.e., number of observed routes): The result isbased on using the history of traversed road segments oflength 10.

4 different times of the day (specifically, 2 a.m., 8 a.m.,1 p.m., 9 p.m.), each lasting 45 minutes. The results areshown in Fig. 4. It can be observed that the time of theday has minimal effect on the prediction accuracy exceptduring 2 − 2:45 a.m. This time of the day offers higherprediction accuracy when the history of traversed roadsegments is short (i.e., of length 2 to 4). This may bedue to fewer, more easily distinguished routes during thistime.

C. Effect of Smoothing

To evaluate the effect of smoothing, the predictionaccuracy of LCSM with smoothing is compared to thatof MLE without smoothing [7], as described previouslyin Section II-C. Besides considering the Bugis data, wealso experiment with the more sparse data taken fromthe whole Singapore. The size of training data for eachregion is 6000 routes. The results are shown in Fig. 5.Firstly, it can be observed that LCSM with smoothinggenerally performs better than MLE without smoothingfor both Singapore and Bugis data, with the performanceimprovement being more significant for the sparse Sin-gapore data. The latter demonstrates that smoothing canachieve larger performance improvement for more sparse

Page 6: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

0 5 10 15 200

0.2

0.4

0.6

0.8

1

Length of history of traversed road segments

Pre

dic

tion a

ccu

racy

2 am

8 am

1 pm

9 pm

Figure 4: Graphs of prediction accuracy vs. length ofhistory of traversed road segments at different times ofthe day.

0 5 10 15 200.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Length of history of traversed road segments

Pre

dic

tio

n a

ccu

racy

Bugis_MLE

Bugis_LCSM

SG_LCSM

SG_MLE

Figure 5: Graphs of prediction accuracy vs. length ofhistory of traversed road segments: “SG LCSM” and“SG MLE” denote results that correspond to data takenfrom the whole Singapore while “Bugis LCSM” and“Bugis MLE” denote results that correspond to data takenfrom the Bugis area.

data. Secondly, the prediction accuracy is lower for thelarger Singapore road network, which is expected due todata sparsity.

D. Time and Space Overhead

As mentioned in Section III-D, the learning time islinear in the size of training data D. Fig. 6a shows thatthe incurred learning time is linear, and it only takes about3 seconds to learn from 6000 routes, which is sufficient forachieving a high prediction accuracy, as reported earlier inSection IV-B. The prediction time is linear in the depthof the aggregation tree. Fig. 6b shows that the incurredprediction time stabilizes at about 0.03 seconds afterlearning from 6000 routes because the depth of the treehas more or less stabilized as well. The space overheadis also very reasonable. After learning from 6000 routes,our sequential model uses about 50MB of space.

0 1000 2000 3000 4000 5000 6000 70000

0.5

1

1.5

2

2.5

3

3.5

4

No. of trajectories in training data

Tra

inin

g t

ime

(sec

onds)

(a) Learning time

0 1000 2000 3000 4000 5000 6000 70000.01

0.015

0.02

0.025

0.03

0.035

No. of trajectories in training data

Pre

dic

iton t

ime

(sec

onds)

(b) Prediction time

Figure 6: Graphs of time overhead vs. size of training data.

V. PRACTICAL APPLICATIONS

In this section, we demonstrate two practical applica-tions of our route predictive algorithm.

A. Predictive Taxi Ranking

The taxi dispatch system assigns taxis to the callers/passengers waiting at different locations. Efficient taxidispatching is essential to reducing passengers’ waitingtime and consequently, greater customer satisfaction.

A key factor that determines the system efficiency isthat of ranking of the taxis to be dispatched. The state-of-the-art way of ranking [15] is based on the shortestpath or duration from the taxi to the passenger, whichcan be retrieved by providing the GPS locations of thetaxi and passenger to Google Maps API. In practice,such a ranking algorithm often performs poorly: The GPSsampling frequency is low, thus resulting in a discrepancybetween the GPS location of the taxi and its actual locationat the time of booking. Furthermore, there is a latencybetween the receipt of the booking message and theresponse of the taxi driver to it, during which the taxiwould have moved along its original intended route forsome distance to a new location. As a result, the taximay no longer be “closest” to the passenger in terms ofpath distance or waiting time. This problem is exacerbatedwhen the taxi crosses an intersection or moves into acongested road or expressway during this latency period. Ifthere are many other taxis roaming in the same area, thensome may have moved to be even closer to the passenger.This motivates the need to predict the future locations ofthe taxis using our route prediction model and exploit themfor the ranking instead.

A simple experiment is set up to empirically comparethe performance of our predictive taxi ranking (PTR)algorithm with that of the state-of-the-art non-predictivetaxi ranking (NTR) algorithm [15]: to simulate the taxidispatch system, we pick 202 taxis roaming in the Bugisarea, and place 11 passengers at random locations in Bugiscalling for taxis. We assume the latency to be between 5to 10 seconds and our model predicts the future locationsof the taxis after this latency. Table I shows the waitingtime of the 11 passengers. With a high prediction accuracyof about 85%, our PTR algorithm can achieve shorterwaiting time for 5 passengers and the same waiting time asthe NTR algorithm for the remaining ones, thus resultingin shorter average waiting time over all passengers. We

Page 7: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

Algo. P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11∑11

i=1 Pi/11NTR 86 551 83 68 506 365 156 157 10 57 121 196PTR 86 551 83 41 424 365 9 31 10 57 33 153

Table I: Waiting time (seconds) of passenger Pi for i = 1, . . . , 11.

conjecture that the performance of our PTR algorithm willbe even better than that of the NTR algorithm with greaterlatency and less available taxis.

B. Route Recommendation

Existing route planning methods such as that in GoogleMaps consider explicit criteria such as shortest distance ortravel time (calculated based on speed limit3). In contrast,our route prediction algorithm recommends maximum-likelihood routes that implicitly account for all the (hid-den) criteria (e.g., tolls, time of the day, etc.) affectingthe crowds of route planning agents. This is illustrated inFig. 7.

VI. RELATED WORK

Researchers have developed mobility models to addressroute prediction problem. The most naive idea is to doprediction by matching current route to a set of observedroutes using data mining methods [2]. But, they are notcapable of predicting a route that has never appearedin the training data. The decision-theoretic work of [16]modeled a driver’s route selection as a Markov DecisionProblem (MDP) and employed a maximum entropy ap-proach to imitate the driver’s behavior. It suffered fromcomputational inefficiency due to the MDP solver. Thework of [17] adopted the kinematic model and mappedGPS locations to hidden velocity variables. They modeledthe moving vehicles as a mixture of Gaussian processes(GP) with a Dirichlet process (DP) prior over the mixtureweights. This approach works well, but has granularityproblem: the prediction accuracy depends on the sizeof each disjoint cell/region. Since the approach directlymaps each GPS location to a velocity, it cannot deal withpredictions involving the same cells/regions such as u-turns or crossing the same intersections more than once.

The most closely related work to our sequential modelis that of Markov models. Recall that our processed data isa set of road segment sequences. Markov models assumethat the next road segment only depends on a few preced-ing road segments, which is called Markov assumption.[8] trained a hidden Markov model to predict a driver’sfuture road segments while simultaneously predicting thedriver’s destination. Their model was restricted to a single-vehicle scenario. Although they achieved a next-segmentprediction accuracy of up to 99% with 46 trips, there wasonly one possible next road segment to choose from (i.e.,perfect prediction) for 95% of road segments. The workof [7] on Markov-based route prediction studied how theorder of the Markov model affects the prediction accuracy.However, they did not account for the important issue ofsparsity of data, which occurs in our real-world data. We

3http://goo.gl/QEvqW, http://goo.gl/7nHi4

show through empirical comparison (Section IV) that ouralgorithm can resolve the issue of sparsity and achievebetter prediction accuracy than that of [7]. Other modelsare limited by some unrealistic model assumptions suchas knowing the origins and destinations of vehicle routes[18], one unique destination in the problem [16], and fullyobserved routes [19].

The issues of real-time prediction and online learningare rarely considered in the literature. With high-orderMarkov assumption, the space complexity is exponentialand learning time is quadratic in the length of the input se-quence. The learning time for our model is also quadratic,but its space complexity is only linear. Our proposedincremental learning method also differs from those worksperforming batch learning [7], [17] (i.e., training the modelwith all the observed data). Our model can be updatedincrementally/online as and when new data comes.

VII. CONCLUSION

This paper describes a hierarchical Bayesian non-parametric model to aggregate the sequential decisionsfrom the crowds of route planning agents for performingefficient and scalable route prediction. By not imposinga Markov assumption, the results show that our modelcan achieve a prediction accuracy of around 85% for theBugis area which has a dense road network and highvolume of taxi traffic. The results also show that our modelwith smoothing can be effectively used to improve theprediction accuracy over the widely-used MLE with nosmoothing for sparse traffic data. Our model is highlytime-efficient in learning and prediction: It only takesabout 3 seconds to learn from 6000 taxi routes and0.03 seconds to perform prediction. In terms of practicalapplications, our model can be used in predictive taxiranking to achieve shorter passengers’ waiting time overmany test instances as compared to the state-of-the-artnon-predictive one. It can also be used to recommendroutes that implicitly account for the criteria (e.g., tolls)affecting the crowds of route planning agents.

ACKNOWLEDGMENTS

This work was supported by Singapore-MIT AllianceResearch and Technology (SMART) Subaward Agree-ments No. 14 R-252-000-466-592 and No. 28 R-252-000-502-592.

REFERENCES

[1] J. Chen, K. H. Low, C. K.-Y. Tan, A. Oran, P. Jaillet, J. M.Dolan, and G. S. Sukhatme, “Decentralized data fusionand active sensing with mobile sensors for modeling andpredicting spatiotemporal traffic phenomena,” in Proc. UAI,2012, pp. 163–173.

Page 8: Hierarchical Bayesian Nonparametric Approach to Modeling ...web.mit.edu/jaillet/www/general/IAT2012.pdf · Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the

(a) (b)

Figure 7: Given origin (Singapore General Hospital) and destination (Old Airport Road Food Center), our algorithmrecommends two different routes depending on time of the day: (a) Take Nicoll highway during 3 − 6 a.m. due to noelectronic road pricing (ERP) fee, and (b) avoid Nicoll highway during 12 noon − 2 p.m. as ERP is in effect.

[2] J. Froehlich and J. Krumm, “Route prediction from tripobservations,” in Proc. Society of Automotive Engineers(SAE) 2008 World Congress, 2008.

[3] K. Zheng, Y. Zheng, X. Xie, and X. Zhou, “Reducing un-certainty of low-sampling-rate trajectories,” in Proc. ICDE,2012.

[4] J. Surowiecki, The wisdom of crowds. Knopf DoubledayPublishing Group, 2005.

[5] S. Ertekin, H. Hirsh, and C. Rudin, “Learning to predict thewisdom of crowds,” in Proc. Collective Intelligence, 2012.

[6] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin,L. Bogoni, and L. Moy, “Learning from crowds,” JMLR,vol. 11, pp. 1297–1322, 2010.

[7] J. Krumm, “A Markov model for driver turn prediction,” inProc. Society of Automotive Engineers (SAE) 2008 WorldCongress, 2008.

[8] R. Simmons, B. Browning, Y. Zhang, and V. Sadekar,“Learning to predict driver route and destination intent,”in Proc. IEEE ITSC, 2006, pp. 127–132.

[9] J. Goodman and S. Chen, “An empirical study of smoothingtechniques for language modeling,” Computer Speech andLanguage, vol. 13, no. 4, pp. 359–393, 1999.

[10] M. Farach, “Optimal suffix tree construction with largealphabets,” in Proc. FOCS, 1997, pp. 137–143.

[11] Y. W. Teh and M. I. Jordan, “Hierarchical Bayesian non-parametric models with applications,” in Bayesian Non-parametrics. Cambridge Univ. Press, 2010.

[12] J. Gasthaus, F. Wood, and Y. Teh, “Lossless compressionbased on the sequence memoizer,” in Proc. Data Compres-sion Conference, 2010, pp. 337–345.

[13] F. Wood, C. Archambeau, J. Gasthaus, L. James, andY. Teh, “A stochastic memoizer for sequence data,” in Proc.ICML, 2009, pp. 1129–1136.

[14] P. Newson and J. Krumm, “Hidden Markov map matchingthrough noise and sparseness,” in Proc. 17th ACM SIGSPA-TIAL International Conference on Advances in GeographicInformation Systems, 2009, pp. 336–343.

[15] D. Lee, H. Wang, R. L. Cheu, and S. H. Teo, “A taxidispatch system based on current demands and real-timetraffic information,” Transportation Research Record, vol.1882, pp. 193–200, 2004.

[16] B. Ziebart, A. Maas, J. Bagnell, and A. Dey, “Maximumentropy inverse reinforcement learning,” in Proc. AAAI,2008, pp. 1433–1438.

[17] J. Joseph, F. Doshi-Velez, and N. Roy, “A Bayesian non-parametric approach to modeling mobility patterns,” inProc. AAAI, 2010.

[18] A. Karbassi and M. Barth, “Vehicle route prediction andtime of arrival estimation techniques for improved trans-portation system management,” in Proc. IEEE IntelligentVehicles Symposium, 2003, pp. 511–516.

[19] D. Ashbrook and T. Starner, “Using GPS to learn significantlocations and predict movement across multiple users,”Personal and Ubiquitous Computing, vol. 7, no. 5, pp. 275–286, 2003.


Top Related