b tech project thesis

7/22/2019 B Tech Project Thesis

1/69

DATA ANALYTICSBASED

DYNAMIC PASSENGER INFORMATION SYSTEM

A Project Report

submitted by

RAKESH BEHERA

in partial fulfilment of the requirements

for the award of the degree of

BACHELOR OF TECHNOLOGY

TRANSPORTATION DIVISION

DEPARTMENT OF CIVIL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

MAY 2014


2/69

CERTIFICATE

This is to certify that the project report titled Data Analytics Based Dynamic Passenger

Information System, submitted byRakesh Behera, to the Indian Institute of Technology,

Madras, for the award of the degree ofBachelor of Technology, is a bonafide record of

the research work done by him under my supervision. The contents of this report, in full orin parts, have not been submitted to any other Institute or University for the award of any

degree or diploma.

Dr. Lelitha Devi V.

Project Guide

Associate Professor

Dept. of Civil Engineering

IIT-Madras, 600 036

Prof. Meher Prasad A.

Head of the Department

Professor

Dept. of Civil Engineering

IIT-Madras, 600 036

Place: Chennai

Date: 19th May 2014

i


3/69

ACKNOWLEDGEMENTS

My earnest thanks to Dr. Lelitha Devi, for her support throughout the study. It is through her

guidance that the project has gained structure and been accomplished in such a short span

of time. Her foresight and expertise has helped us make the right choices in the project

and otherwise. I am thoroughly indebted to her for the amount of time she has spent in

reviewing my analyses and reports. I thank her for her belief in my potential in carrying out

the tasks involved. I consider it a privilege to have worked under her guidance.

I also owe my gratitude to Dr. Shankar Ram C. S. for his valuable inputs. His contribu-

tion could not have been substituted by anyone else. I also thank Dr. J. Murali Krishnan for

the constant support and encouragement that he has provided me throughout my academic

life at IITM. I take this opportunity to thank Akhilesh, Krishna, Siddharth and Anil for the

help offered by them in data acquisition and the development of the online version of the

framework. I would also like to acknowledge all the other project staff and students at the

Centre of Excellence in Urban Transportation, IIT Madras.

Friends have been an integral part throughout the stay here at IIT Madras. Life at IITM

cannot be complete without them. I thank all my friends and wing mates for making my

stay here at IIT Madras, a memorable one.

Finally, I would like to thank my parents and my younger brothers for their enduring

support and unconditional love, without which this project would not have been possible.

ii


4/69

ABSTRACT

KEYWORDS: Travel Time Prediction, Historical Trajectory Search, Kalman Fil-

ter, V-clustering.

The present study developed a reliable system for real-time bus arrival/travel time predic-

tion under heterogeneous traffic conditions that exist in India. The study is different from

(and more challenging than) most of the previous studies which involved homogeneous

traffic conditions. To accomplish the above goal, a robust framework namely, Historical

Trajectory and Kalman Filter based Travel/Arrival Time Prediction (HTKFTP) is proposed

in this study. The proposed framework has two major components: (i) similar trajectory

search; (ii) travel time prediction using similar trajectories. Through the data analysis

performed, travel time correlations (between spatially close stretches of road) and other

temporal patterns in travel times were identified, which were used for the development of

various schemes for the selection of historical trajectories. The prediction algorithm based

on Kalman Filter was also improved to account for the high variance in travel times on cer-

tain locations or during certain time of the day. The proposed schemes were corroborated

using real-world GPS trajectory data collected from the Metropolitan Transport Corpora-

tion (MTC) buses in Chennai.

iii


5/69

TABLE OF CONTENTS

CERTIFICATE i

ACKNOWLEDGEMENTS ii

ABSTRACT iii

LIST OF TABLES vii

LIST OF FIGURES viii

ABBREVIATIONS ix

NOTATION x

1 INTRODUCTION 1

1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 LITERATURE REVIEW 6

2.1 A Brief History of Traffic Prediction . . . . . . . . . . . . . . . . . . . 6

2.2 Approaches Exploiting "Similarity" . . . . . . . . . . . . . . . . . . . 9

2.3 Trajectory Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 DATA ANALYSIS 12

3.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Data Cleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Extracting Trip Data . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Calculation of Segment-wise Travel Times . . . . . . . . . . . 15

iv


6/69

3.3 Correlation Between Segments . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Travel Time Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 THE FRAMEWORK AND THE CLUSTERING ALGORITHM 23

4.1 Terms and Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Trajectory Search based on Passed Segments Scheme . . . . . . . . . . 27

4.5 The Clustering Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Nearest Neighbour Search in Passed Segments Scheme . . . . . . . . . 30

4.7 Similarity based on Temporal Features . . . . . . . . . . . . . . . . . . 31

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 THE PREDICTION ALGORITHM 32

5.1 Travel Time Prediction using Kalman Filter . . . . . . . . . . . . . . . 32

5.2 The Base KF Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Integration of Trajectory Search and Prediction algorithms . . . . . . . 35

5.4 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 PERFORMANCE EVALUATION 38

6.1 Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Parameter Optimization in Passed Segment Scheme . . . . . . . . . . . 39

6.2.1 Spatial lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2.2 Minimum Number of Trajectories (MNT) in a Cluster . . . . . 40

6.3 Evaluation of the PS scheme . . . . . . . . . . . . . . . . . . . . . . . 41

6.4 Evaluation of the Weekday/Weekend Temporal Feature . . . . . . . . . 41

6.5 Evaluation of the Temporal Neighbourhood Feature . . . . . . . . . . . 42

6.6 Evaluation of the base KF Algorithm for Prediction . . . . . . . . . . . 42

6.7 Evaluation of the Adaptive KF Algorithm . . . . . . . . . . . . . . . . 44

6.8 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 SUMMARY AND CONCLUSIONS 47

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

v


7/69

7.3 Scope for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 48

A PYTHON CODE LISTING FOR CLUSTERING ALGORITHM 49

A.1 Method for creating clusters from similar trips . . . . . . . . . . . . . . 49

A.2 Auxiliary method for finding optimum splits in the clustering algorithm 51

A.3 Method for finding nearest neighbours from clusters. . . . . . . . . . . 51


8/69

LIST OF TABLES

3.1 A sample of the raw data received from the GPS devices on the buses. . 13

3.2 A sample of data records after transformation. . . . . . . . . . . . . . . 14

4.1 An example of segment-wise travel times on historical trajectories. . . . 28

4.2 An example of partitioned segment-wise travel times after application of

clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

vii


9/69

LIST OF FIGURES

3.1 Pearsons correlation coefficients versus the segment distance. . . . . . 16

3.2 Average correlation coefficient versus the segment distance.. . . . . . . 17

3.3 Travel time analysis by hours of the day . . . . . . . . . . . . . . . . . 18

3.4 Comparison between weekday peak and weekday off-peak trips. . . . . 19

3.5 Correlations between travel times occurring in different hours of the day 20

3.6 Comparison between weekday and weekend trips. . . . . . . . . . . . . 21

3.7 Comparison between the weekdays. . . . . . . . . . . . . . . . . . . . 22

4.1 Overall architecture of the HTKFTP framework . . . . . . . . . . . . . 26

5.1 Variation of travel time variance across the segments of 19B route . . . 36

6.1 Optimum values of parameters involved in the clustering algorithm. . . 40

6.2 Comparison of MAE for individual test trips before and after adding the PS

scheme to the naive method. . . . . . . . . . . . . . . . . . . . . . . . 42

6.3 Comparison of MAE for individual test trips before and after adding the

weekday/weekend feature to the PS scheme. . . . . . . . . . . . . . . . 43

6.4 Comparison of MAE for individual test trips before and after adding the

temporal neighbourhood feature. . . . . . . . . . . . . . . . . . . . . . 43

6.5 Comparison of MAE for individual test trips before and after using the base

KF algorithm for prediction. . . . . . . . . . . . . . . . . . . . . . . . 44

6.6 Comparison of MAE for individual test trips before and after using the

Adaptive KF algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.7 Improvement of the mean MAE (over all the test trips) throughout the evo-

lution of the method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.8 Comparison between HTKFTP and the prediction method using static in-

puts in KF.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

viii


10/69

ABBREVIATIONS

AI Artificial Intelligence

ANN Artificial neural networks

DTW Dynamic Time Warping

ED Euclidean Distance

GPS Global Positioning System

HTD Historical Trajectory Database

HTKFTP Historical Trajectory and Kalman Filter based Travel/Arrival Time Prediction

HTTP Historical Trajectory based Travel time Prediction

KF Kalman Filtering

k-NN k-Nearest Neighbors

LCSS Longest Common Subsequence

MAE Mean Absolute Error

MAPE Mean Absolute Percentage Error

MLR Multivariate Linear Regression

MTC Metropolitan Transport Corporation (Chennai)

NNS Nearest Neighbour Search

RBMS Real-time Bus Status Monitoring

SARIMA Seasonal Autoregressive Integrated Moving Average

SVR Support Vector Regression

TTP Travel Time Prediction

ix


11/69

NOTATION

Correlation coefficient between two variables

Rraw A raw route represented as a sequence of points, pis

Si A segment of road between two points,pi andpi+1

R A raw route represented as a sequence of segments,Sis

Bi Theith bus stop on a route

ti Time taken to reach a pointpi on a route, starting fromp0

Traw A raw trajectory represented as a sequence pairs of the form (pi, ti)

ti Actual time taken to cover a segmentSi

ti Predicted travel time onSi

ABi Actual arrival time of the bus at the bus stopBi

ABij Thejth predicted arrival time at the bus stopBi

T A trajectory represented as a sequence oftis

STi List of historical travel times on segmentSi

SCi List of clusters or intervals forSi

CSii Theith cluster forSi

Tcurr The current (or incomplete or test) trajectory. Also denoted asTtest

wavi The weighted average variance for a split at theith element of a list.

ai Travel time evolution factor fromSito Si+1

wi Process disturbance in travel time evolution atSi

zi Measured travel time onSi

vi Measurement noise associated withSi

Qi Variance of the historicalwis forSi

Ri Variance of the historicalvis forSi

x


12/69

CHAPTER 1

INTRODUCTION

1.1 Motivation

With the ever-increasing number of vehicles on roads in urban areas, traffic congestion has

become one of the most serious problems facing the society, especially the commuters.

In India, the problem is more prominent in the metropolitan cities such as Mumbai, New

Delhi, Chennai, etc. One of the reasons people are shifting to private transportation is the

unreliability of the public transportation systems (Bende,2012). Holeywell(2013) points

out that, travellers care most about getting picked up from their stop in 10 minutes or less

to be able to make their scheduled connections. It also points out that, the travellers are not

so interested in whether their rides are crowded or whether they can find a seat.

In todays busy society, information regarding arrival time or travel time of transportfrom a place to another is becoming more and more valuable. With a schedule of predicted

arrival times at each bus stop available via VMS or as mobile or web application, people

can make timely plans for their upcoming activities and business which will reduce their

anxiety caused by uncertain delays. Thus, there is a necessity for a system that can inform

the travellers about the latest travel times of the concerned buses before they make their

transit plans. This may also attract more passengers to use public transport, which in turn

can lead to lesser traffic congestion.

1.2 Background

Accurate estimation of travel times of public transportation has been a challenging research

problem that remains open for the past thirty years in the transportation research commu-

nity (Abkowitz,1981;Polus, 1978). A simple prediction approach is to adopt the averagetravel time derived from historical data. However, making constant estimation of the travel

time for a path, apparently does not capture the dynamic traffic conditions very well. Thus,


13/69

advanced techniques for travel time estimation were proposed in the early literature (Ghosh

and Knapp, 1978;Oda,1990;Nihan and Holmesland,1980). Even though the specific ap-

proaches adopted in these studies are different, they share a common idea, i.e., discover

certain regular patterns from the historical data collected over time. Some proposed to fit

historical data to statistical models such as Gaussian models, Bayesian network and Markov

Chains in order to facilitate statistical analysis (Polus,1978; Sumiet al.,1990). Techniques

based on regression models learn from historical data. They involve building of regres-

sion functions for estimating travel time in terms of various external factors (Polus,1979;

Ghosh and Knapp, 1978). A prediction is made by using known values of those factors un-

der current situation as input. Techniques based on time series models focus on discovering

internal relationship among historical time-series data in order to identify similar patterns

to make prediction under the current situation (Oda, 1990; Nihan and Holmesland, 1980).

However, the performances of the above approaches are highly constrained by the quali-

ty/quantity as well as the types of data available. For example, conventional collection of

traffic data is typically conducted by surveys or using expensive sensors deployed along the

roads at specific locations to record arrival times, traffic flow volumes, and other statistics

of vehicles.

In the recent years, due to the advent of positioning and wireless communication tech-

nologies, wireless devices equipped with Global Positioning System (GPS) have been widely

deployed on various private and public vehicles, generating massive amount of vehicle tra-

jectory data which can be used for fleet management and other transportation applications.

Time-tagged location data, usually represented in the form of trajectories, bring a great po-

tential for real-time prediction of the vehicle travel times. Among the public transportation

systems, the travel times of buses, which drive along with other vehicles on roads, are more

difficult to predict than trains and subways, which ride on exclusive paths. First, the travel

condition of a bus may easily get affected by various internal and external factors, including

accidents, weather, road construction, government policies and even temperature. Second,

for vehicles in metropolitan areas (such as Chennai), errors often exist in positional-data

acquisition due to the interference by urban canopies and other sources of errors. Thus, in

this paper, we propose a hybrid prediction framework to estimate the travel time of buses

by exploiting selected historical trajectory data and an efficient state estimation techniquecapable of making precise estimations by exploiting a series of travel time measurements.

2


14/69

1.3 Research Overview

Recently, research works on discovering traffic patterns from historical data collected from

vehicles have received significant attention (Chenet al.,2011;Li and Rose, 2011;Tiesyte.and Jensen,2009). These works show that traffic patterns exist in road segments and thus

could be used to predict the future traffic condition on the same segment and on a few up-

coming segments. This finding provides the basis for using similar trajectories to predict

the travel time of an ongoing bus journey. In this study, a new bus travel time prediction

framework, calledHistorical Trajectory and Kalman Filter based Travel/Arrival Time Pre-

diction (HTKFTP)for real-time prediction of travel time at upcoming segments (and thus

the arrival time at bus stops) of an ongoing bus journey is carried out. The basic idea behind

HTKFTP is to use a collection of historical trajectories similar to the current bus journey

to predict the travel times in future segments of the bus journey. Specifically, the HTKFTP

framework (i) identifies a setof similar trajectories as the basis for travel time estimation

instead of relying on only one historical trajectory best matching the on-going bus journey;

(ii) explores differentfeatures(e.g., travel times of passed segments as well as time/day of

the bus trajectories) to identify the sample set of similar trajectories; (iii) uses the similar

trajectories as inputs to the Kalman Filter based prediction method.

Several issues were faced in the design of the HTKFTP framework. For example, many

features are associated with the trajectories. Some of these features are categorical while

the others are numerical. Discriminative features and properly defined similarity functions

for those features needed to be used in order to identify a sample set of similar trajectories

effective for travel time prediction. To determine a set of similar trajectories based on travel

time on passed segments, the V-clustering algorithm, that partitions the whole spectrum

of travel times on a segment into a number of intervals (or clusters) was considered. To

determine a set of similar trajectories based on hours/days, exploratory data analysis in-

volving space-time trajectory plots of the historical trips was carried out. Accordingly, the

HTKFTP framework is able to retrieve the sample set of similar trajectories efficiently and

in turn use that sample set to estimate the travel times. To corroborate the proposed ideas

and evaluate the prediction schemes proposed, an empirical experimentation using real bus

trajectory data collected in Chennai, India, was conducted. This research work has made anumber of significant contributions as summarized below.

3


15/69

A new framework, namely, HTKFTP, for predicting the travel times over future seg-

ments of an ongoing bus journey based on historical trajectory data. The framework

consists of two major components: (i) similar trajectory retrieval; and (ii) travel time

estimation.

A detailed data analysis to investigate the correlation between bus travel times in

route segments and a number of trajectory features, e.g., passed segment travel time,

hours, days, etc. Based on our analysis, we select a number of trajectory features to

identify similar trajectories.

A clustering algorithm for passed segment travel times and space-time trajectory

analysis in order to group similar trajectories together. These similar trajectory clus-

ters allow us to efficiently and effectively retrieve a sample set of trajectories similar

to the ongoing bus trajectory.

An efficient state estimation technique based on Kalman Filter, capable of making

precise estimations by exploiting a series of travel time measurements in an inherent

feedback mechanism. The base estimation scheme was modified to take into account,

the large variance in the data observed at selected locations/times.

Through a comprehensive experimental study, using a real data set collected from buses

in Chennai, India, the proposed ideas were validated. The framework was evaluated in

terms of prediction accuracy. The experimental results show that the prediction scheme

proposed, significantly outperforms the baseline and state-of-the-art schemes.

1.4 Chapter Outline

The remainder of this report is organized as follows:

Chapter 2 reviews some literature in the concerned area.

Chapter 3, analyses the collected historical trajectory data.

An overview of the HTKFTP framework and the similar trajectory selection algo-

rithm, detailing its design, is discussed in Chapter 4.

4


16/69

The prediction scheme and the real-time prediction system design are detailed in

Chapter 5.

Chapter 6, reports a comprehensive experimental study using the collected real data

set of bus trajectories, queried in real-time.

Finally, we conclude this work in Chapter 7, with a summary of the work, followed

by conclusions and scope for future work.

5


17/69

CHAPTER 2

LITERATURE REVIEW

This chapter reviews the past research that has fuelled our motivation in prediction of move-

ment of vehicles. We begin by giving a brief history of traffic prediction, and review the

major research that has focused specifically on similarity-based prediction of arrival/travel

times and trajectory patterns.

2.1 A Brief History of Traffic Prediction

Research in transportation dates back to the 30s of the last century. With few vehicles on

roads and under-developed technologies, it was then almost impossible to collect significant

data about traffic conditions. Thus, studies during that time were mainly about identifying

certain rules that could be used to guide traffic management and the construction of trans-portation infrastructure. For example, the relations between traffic volumes and the weather

were reported byJohnson(1930). It justified the improvement of road surfaces during bad

weather. As another typical example, the authors ofVey and Pope(1935) verified a definite

relationship between highway lighting and highway accidents. In general, where adequate

lighting is provided, there is a substantial reduction in night accidents.

With the development of technologies and increasing number of vehicles on roads, more

data about traffic conditions could be collected, subsequently causing the emergence of

research on traffic prediction in the 50s. However, during this period, traffic data adopted

in most cases were vehicle volumes (or flow) because they were easily collected by hand.

For example, in Glanville(1955), Lighthill and Whitham(1955) and Buckley(1968), to

obtain the vehicle volume on a road, observers were placed at certain locations to record

the number of vehicles passed by. Such an approach was inefficient and made it difficult

to collect a large amount of data. Therefore, the arrival/travel time prediction did not arise

until 1970s (Wong and Sussman, 1973;Sussman et al.,1974), when traffic sensors were

widely adopted enabling researchers to have sufficient data for analysis.


18/69

Estimation of arrival/travel times, especially for buses, started to attract increasing at-

tention since the 80s (Abkowitz,1981; Polus, 1978; Sumi et al., 1990). Along with the

development of the society, congestions started happening increasingly in cities which cre-

ated a need to improve the quality of public transportation service. As the most important

aspect of public transportation service, arrival/travel time prediction became the most criti-

cal topic in traffic prediction area. At early stage of the research on this topic, constrained

by technologies, researchers had to work on data collected from traffic sensors and surveys,

off-line. Since the development of GPS devices and wireless technologies, it is possible

to collect large volume of traffic-related data in real-time. Therefore, real-time arrival/-

travel time prediction has become a hot topic since these technologies are widely applied

in public transportation system. Over the decades, researchers applied different models

and methods on real-time arrival/travel time prediction. InZhuet al. (2011), the authors

developed mathematical models taking into account the travel times on links, dwell times

at stops, and delays at intersections. The algorithm proposed in Lin and Zeng(2001) is to

provide real-time bus arrival information based on the bus location data, the schedule infor-

mation, the difference between scheduled and actual arrival times, and the waiting time at

time-check stops. Predicting methods based on historical data are also developed inTiesyte

and Jensen(2008).

With the development ofArtificial Intelligence, researchers have widely adopted Artifi-

cial Intelligence methods in real-time arrival/travel time prediction. As a result of this, travel

time prediction approaches in the modern literature can be broadly classified as model based

anddata driven. Model based approaches predict travel times using traffic flow models and

the underlying physical phenomena. For example,Krishnan and Polak (2008) explored

recurring themes in traffic conditions and used k-Nearest Neighbors (k-NN) for indirectly

predicting short term travel times using 15-minute aggregate flow data. Esawey and Sayed

(2011) used a VISSIM1 micro-simulation model of down-town Vancouver to predict travel

times using traffic volume and travel time data of nearby segments. Kalman Filtering(KF)

is one of the most widely adopted methods in travel time prediction in the recent literature.

Vanajakshi et al. (2009) used a KF based method for predicting segment-wise travel times

(using travel times of previous two vehicles) in heterogeneous traffic conditions prevalent

in Indian cities such as Chennai. KF takes into account the stochastic properties of the1VisSim is a visual block diagram language for simulation of dynamical systems and model based design

of embedded systems.

7


19/69

process disturbance and the measurement noise. It works well for short-term prediction.

Other notable works in KF include Xu et al. (2008), Shalaby et al. (2004) andZhu and

Wang(2000). Westgate et al. (2013) used a Bayesian model for travel time estimation of

ambulances using GPS data.

Data driven approaches predict travel time with the use of statistical relationships, which

are derived from historical data (travel times, speeds, volumes, etc.). The most commonly

reported data driven approaches in the literature include machine learning techniques,time

series analysis and historical averaging approaches. In machine learning techniques, the

prediction model learns some properties from several instances of historical data. For ex-

ample, Patnaiket al. (2004) used a machine learning technique called multivariate linear

regression for bus arrival time estimation using automatic passenger counter (APC) data.

Artificial neural networks(ANNs) is another most widely used method. Liuet al. (2009)

used neural networks to indirectly predict travel times using traffic volume and flow data.

ANNs has a huge advantage that it can process complex non-linear relationships. However,

it is limited by the extremely long training time. Other notable works using ANNs include

van Lint(2006), Zouet al. (2008) andBatool and Khan(2005). Besides, other machine

learning methods are also popular in recent years. Real-time prediction using Support Vec-

tor Regression(SVR) andSupport Vector Machine(SVM) has become a hot topic recently.

For example,Wuet al. (2004) used SVR for travel time prediction using highway traffic

data.Vanajakshi and Rilett(2007),Vanajakshi and Rilett(2004) andYuet al.(2006) are the

other instances of the use of SVR. Similar to ANNs, SVR is too expensive in training for

real-time updates. In a time series analysis approach, temporal patterns are identified in the

historical data and future values are predicted with the assumption that these patterns hold

in the near future. For example,Guin(2006) used a time series analysis approach called

seasonal autoregressive integrated moving average(SARIMA) to predict travel times using

historical travel time data.

Though the model based approaches provide valuable insights into the mechanisms of

traffic flow and queue dynamics, their inherent limitations hinder their application in real-

time systems. The major disadvantages include high computational complexity, intensive

model/parameter calibration, requirement for predicting traffic demand/capacity and the

degree of expertise required for design and maintenance. On the other hand, data driven

approaches can be deployed quicker and cheaper compared to model-based approaches.

8


20/69

They can provide scope for prediction when there is a large diversity (or variance) in the

historical data. In such cases, predicting using physical models which are narrow in scope

can be expensive. In this study, a data driven approach is chosen for travel time prediction

exploiting similar historical trajectories, as explained in the following sections.

2.2 Approaches Exploiting "Similarity"

Since the bus trips repeat in the same route, more or less around the same time, on dif-

ferent days, the similarity-based approach is the straightforward approach to predict future

travel times. Great amount of work has been done on identifying similar trajectories or

similar time series, in both one-dimension and multi-dimensions. Yi and Faloutsos(2000)

proposed Lp-norm to compute the Manhattan Distance or Euclidean Distance as a mea-

sure of similarity. Lp-norm is widely applied in various applications but is only available

for time series with same length. Therefore, other similarity measures are developed and

adopted. Berndt and Clifford(1994) introducedDynamic Time Warping(DTW) which was

adopted later inAssentet al.(2009) andVlachoset al.(2006). The concept ofedit distance

was introduced inLevenstein(1966) and the most widely used distance based on edit dis-tance is Longest Common Subsequence (LCSS) distance. Vlachos et al. (2002),Fashandi

and Moghaddam(2005) andHermeset al.(2009) applied LCSS as the distance measure to

fetch similar trajectories or time series. However, these algorithms tend to emphasize on the

overall similarity of the whole trajectory, without considering the similarity of trajectories

in individual or subsets of segments. Additionally, while LCSS and DTW are applicable to

trajectory data, they are highly sensitive to noises and errors. In this project, similarity mea-

sure of trajectories based on similarity of corresponding individual segments is proposed.

Recently, prediction methods based on historical trajectory data have also been developed

inJensen and Tie(2008),Tiesyte. and Jensen(2009) andTiesyte and Jensen(2008). The

authors show that the similarity between historical trajectories and current position data

of a bus can be exploited to predict bus arrival time at bus stations. This shares the same

intuition with the development of the Historical Trajectory based Travel time Prediction

(HTTP) framework inLee et al.(2012). The present project can be said to be built on top

of HTTP with the inclusion of additional features for the selection of similar trajectories

and the use of Kalman Filter for prediction. InJensen and Tie(2008), the authors devel-

9


21/69

oped a system called TransDB, that searches the historical trajectory database for the most

similar trajectory based on the passed segments of the current bus trajectory in order to

make a good prediction. The basic idea is that, based on the proposed trajectory similarity

function, thenearest neighbourhood trajectory(NNT) and the trajectory of current bus ride

are anticipated to exhibit similar travelling behaviour (in terms of travel time). Based on

this assumption, the NNT serves as a good basis for predicting the future travel time of

current bus ride without explicitly taking into account various external and internal factors.

However, Lee et al. (2012) argue that the historical trajectory that is most similar to the

passed segments of the current bus trajectory alone may not provide the best prediction of

the on-going bus ride. Thus, they collect a set of similar trajectories and adopt a statisti-

cal approach to make predictions. Additionally, they exploit different features associated

with trajectories and develop different similarity functions to find similar trajectories that

make significantly more accurate travel time predictions. Our approach varies from HTTP

in a way that it is not a statistical approach. From the data analysis studies as explained in

Chapter 4, it was found that, though the statistical approach provides satisfying predictions

for a few upcoming segments, the predictions get worse if they were made for a time far

into the future (future segments far from the current one). It was observed that the historical

trajectories within a temporal neighbourhood of 30 minutes around the ongoing trajectory,

are more significant in improving the prediction accuracy for farther future segments. Addi-

tionally, with the error feedback mechanism inherent in the KF based prediction algorithm,

the accuracy tends to improve from one future segment to the next one during prediction.

2.3 Trajectory Patterns

Patterns of historical trajectories are described in two classes: trend and periodicity. The

trend represents a general systematic linear or non-linear component that changes over

time and does not repeat or at least does not repeat within the time range captured by data.

The periodicity represents the component that repeats itself in certain intervals of time. In

Wuet al. (2003), the authors display the daily periodicity from historical data of travel

times around the same location. Zhu et al. (2009) also conducted an analysis to verify

the existence of periodicity of speeds over time on a route segment. Chen et al. (2011)

andLi and Rose (2011) verify the pattern by measuring the correlation between the traffic

10


22/69

on a specific route of different time periods. Vanajakshiet al.(2009) analysed travel time

variation plots in heterogeneous traffic conditions, using GPS trajectory data from the buses

in Chennai, India. From their analysis, they concluded that the travel time patterns were

more related for consecutive vehicles (with a headway15 min) on the same day. Weeklyand daily patterns were not as significant as the above one. Hence, they used the travel times

of the previous two vehicles for prediction. Similarly,Kumar and Vanajaksh(2012) used

a statistical test to check whether the previous trips on the same day or previous days(s)

same-time trip or previous week(s) same-day/same-time trip is significant in predicting

the travel times of and ongoing trip. The authors concluded that the previous two weeks

same-day/same-time trips and the previous three trips on the same day were significant and

could be included as inputs in the prediction model developed using a simple exponential

smoothingtechnique.

It is clear from the above attempts that, the travel time behaviour of the vehicles moving

on fixed routes is not random. There exist significant patterns in travel times for trips made

around the same time of the day. Such patterns verify the possibility of using historical

data of a certain segment to predict for the future traffic condition on the same segment.

This forms the basis for the development of the HTKFTP framework introduced in Chapter

1. The next chapter discusses the various kinds of analyses carried out on real-world bus

trajectory data to explore possible patterns in travel times.

11


23/69

CHAPTER 3

DATA ANALYSIS

Several analyses to explore the correlations and patterns in the historical trajectory data

comprising of segment wise travel times was carried out. Our goal in data analysis is two-

fold:

To verify the suitability of using historical trajectories for prediction of future travel

times for an ongoing trajectory; and

To explore any patterns in travel time data that can be used for the prediction.

3.1 Raw Data

The raw GPS data used in this study were collected over a period of 4 months, from January2014 to April 2014, from the Metropolitan Transport Corporation (MTC) buses, running on

one of the busiest routes in Chennai namely, 19B which connects Kelambakkamin south

to Saidapetin central Chennai. Each bus is equipped with a GPS device that records the

status of the bus along with its movement and pushes the status to a central server every 10

seconds. Each data point consists of the GPS coordinates and the corresponding time-stamp

as shown in Table3.1. Each bus and each route have their own identifications. Location

details of each bus stop in a selected route is collected and stored. Each bus station has its

own name, GPS coordinates as well as the IDs of the routes it belongs to, so that all the

bus stations for each route can be found. For a specific route, a bus station has a sequence

number among all bus stations belonging to this route. In most cases, there are more than

one bus travelling on a route. Each bus travels on a fixed route several times in a day.

Taking the data of the last four months, i.e., from January to April 2014, there were

totally 28 buses on 19B route which completed 3,686 trajectories running back and forth.

The north-bound 19B route with an ID 1101 is chosen for analysis. This route has 15 stops,

with the origin at the Kelambakkam Bus Station and the last stop at Saidapet Bus Depot. It


24/69

Table 3.1: A sample of the raw data received from the GPS devices on the buses.

Timestamp Longitude Latitude

04-Apr-14 09:28:15 80.242317 13.005729

04-Apr-14 09:28:25 80.242317 13.005729

04-Apr-14 09:28:35 80.242241 13.005681

04-Apr-14 09:28:45 80.241928 13.005391

04-Apr-14 09:28:55 80.241828 13.004879

covers a distance of 29.4 kilometres (i.e. 147 segments) and the average trip duration from

the origin to the destination, is about 4000 seconds. From January to April 2014, there are

totally 2,212 north-bound trajectories in this route.

3.2 Data Transformation

From the raw data of time-stamp and latitude/longitude, other useful quantities such as

distance, cumulative distance, UNIX time, time difference and speed were calculated as

explained below. Distance (assuming straight line travel) between two consecutive GPS

locations of a bus was found out using thehaversine formulaas shown in Equation3.1.

D=R cos1 (a + b)

where

R= radius of Earth= 6371000m(mean)

a=cos

2 lat1

cos

2 lat2

b= sin

2 lat1

sin

2 lat2

cos(lon1 lon2)

(3.1)

Table3.2shows sample transformed data. From the calculated distances, the corresponding

cumulative distance travelled till each GPS point was also calculated. The timestamp data

initially in the format "dd-mm-yyyy HH:MM:SS" (a string), was converted into UNIX time

format1. This conversion speeds up several operations with the time-stamps. With the help

of these time-stamps, the time difference between each pair of consecutive GPS points was

calculated (column with headingt(s) in Table 3.2). Speed of the bus at a particular point

was calculated by dividing the corresponding value of distance by the time difference.1The UNIX time form of time-stamp is the number of seconds (an integer) passed since 00:00:00 hours,

January 01, 1970 till the timestamp under consideration.

13


25/69

Table 3.2: A sample of data records after transformation.

UnixTime(s) Lon( ) Lat( ) t(s) Dist(m) CumDist(m) Speed(m/s)

1379390422 80.127693 12.923 10 43.1092 294.1599 4.31092

1379390432 80.127899 12.922989 10 22.384146 316.544 2.238414

1379390443 80.128448 12.92292 11 60.106274 376.6503 5.464206

1379390453 80.128997 12.922869 10 59.87249 436.5228 5.987249

1379390463 80.129661 12.922829 10 72.165902 508.688 7.21659

3.2.1 Data Cleaning

There were several stray records in the raw data. These could be detected using the distance

and time difference values. Some records had distance more than 1000 metres in a 10

second interval which is impossible since the corresponding speed becomes more than 360

km/h. This may be because of errors in (or misplacement of) the longitude and latitude

values. A higher value of time difference implies the absence of several GPS logs. The

distance in these cases is also inaccurate (since we assumed straight line travel and the bus

might have undergone several changes in direction in a long time). Such data were detected

in an automated way and were not considered for analysis.

3.2.2 Extracting Trip Data

The daily data files for each device included multiple trips made by that bus in that day. The

first task was to extract the data trip wise and store in separate CSV files. Each such trip file

consisted of about 600 records with the first record corresponding to departure of the bus

from the origin bus station and the last one corresponding to the arrival at the destination

bus station. There were 3 - 4 trips made by each bus per day. Each trip file was named in

the following name format: "IMEI_date_start time_direction.csv", whereIMEIand

dateare the IMEI number corresponding to the device and the date of data. Start timeis the

timestamp of the first record of the file (i.e. departure time) and directionimplies whether

it is a north bound trip or a south bound trip.

After this extraction, the cumulative distance was updated for each trip separately. The

cumulative distance and the UNIX time were used for plotting the space-time trajectories

(cumulative distance vs. cumulative time) for further analysis as discussed in Section 3.4.

14


26/69

3.2.3 Calculation of Segment-wise Travel Times

For the calculation of travel time, the study routes were discretised into smaller segments

of length 200 meters each. These segments had fixed end points which were maintained

throughout the analysis for calculating the historical travel times. For a particular route,

these segmental travel times were stored in a grid layout where each column represented a

segment and each row representing a trip. Thus, a column consisted of the travel times on a

particular segment for all the trips over four months and a row consisted of the travel times

on all the segments of the route for a particular trip.

3.3 Correlation Between Segments

Analysis was carried out to find correlations between the segments, which check that the

choice of passed segment travel times as a similarity measure to find historical trajectories,

is suitable for prediction of future segment travel times. If a correlation in terms of travel

times exists between segments, previous segments can be taken as related to later segments

along the route. Given a current trajectory and its similar historical trajectory in terms of

passed segments, their future travel times are also similar with a high probability. We use

Pearsons correlation as the tool to measure the correlation between segments. Pearson

Product Moment Correlation (Pearsons correlation for short) is widely used to measure

the linear association between two variables. The value of the Pearsons correlation coeffi-

cient always falls between -1 and 1. Positive values mean positive correlations and negative

values mean negative correlations. The farther the value from 0, the stronger is the corre-

lation. Given two variablesXand Y, with means X

and Y

and the standard deviationsX

andy, correlationbetween them is computed as,

=

ni=1

(Xi X)(Yi Y)

(n 1)XY

(3.2)

wheren is the number of elements in X andY. The farther two segments are from each

other, the weaker will be the influence of one on the other.

Figure3.1can be used to detect the Pearsons correlation for any two segments as well

as its trend along with distance. Y-axis values are the Pearsons correlation coefficient

15


27/69

between two travel time arrays (historical travel times corresponding to two segments) and

X-axis shows the number of segments between them, which is termed as the Segment-

Distance. For example, given a Pearsons correlation between segment 20 and segment 25,

a corresponding point is drawn on the figure with the X-value being 5 (which is 25 minus

20). However, such a figure is not able to offer a clear illustration of the change of Pearsons

correlation because there are too many points for each X-axis value. To solve this problem,

we plotted Figure3.2,which represent the average value of all points for each X-value.

As shown in Figure 3.1, the Pearsons correlation exists commonly between any ar-

bitrary segments. However, the correlation does not appears to be high for most pair of

segments. Specifically, when two segments are near to each other, the Pearsons correlation

is remarkable and obviously higher than others. Therefore a segment is more related to

nearby segments than farther ones. Figure3.2 indicates an apparent decline curve from 1

along the X-axis. Based on this, it can be concluded that segments closer to the one being

analysed is the most correlated one and can be used as input for prediction.

Figure 3.1: Pearsons correlation coefficients versus the segment distance.

3.4 Travel Time Patterns

The second goal of data analysis was to explore any pattern inside the data that could be

used for the prediction. Intuitively, travel times of a segment should not only be related to

that of near segments, but also to other segment specific or traffic related parameters. For

16


28/69

Figure 3.2: Average correlation coefficient versus the segment distance.

example, in a city area, the traffic conditions are usually the worst during the peak hours in

the morning and evening. Therefore, we can associate the travel times to a temporal feature.

Similarly, the travel times on the same segment may appear differently in weekdays and

weekends. In weekdays, the travel time may be higher than that on weekends.

The present study analyses two of those patterns, which are most common, namely day-

wise pattern and time of the day pattern. During peak hours in the morning and evening,

congestions happen with a high probability. Therefore the travel time of a segment may be

high during peak hours and low in off-peak hours. To visualize the travel time variation

within a day, within-day travel times are grouped into 14 time periods 2 of 1 hour each.

Figure3.3shows the variations in travel times along a day for two typical segments namely,

Segment 28 and Segment 100. For each of these segments, travel times are assigned into

the selected 14 bins in terms of the hour in which they happened. The Y-axis represents

the travel time in seconds. For each box plot, the thick line in the middle of the box is the

median. The upper edge and lower edge of the box are the75th and25th percentiles of the

data, respectively. Some data regarded as outliers are shown as bubbles (outside the upper

and lower fences3).

It can be seen that the travel times in the morning from 8 am to 10 am and in the evening

2The usual working hours of the MTC buses.3

The upper fence (end point of dotted line) is calculated as Median + 1.5(IQR) and the lower fence asMedian - 1.5(IQR), where IQR = The inter-quartile range, i.e., the difference between the 75th and 25th

percentile values of the data.

17


29/69

(a) Variation of travel times on Segment 28 across the hours of the day

(b) Variation of travel times on Segment 100 across the hours of the day

Figure 3.3: Travel time analysis by hours of the day

from 5 pm to 7 pm are relatively higher than others. It can be expected that travel times

on a segment happening in peak hours are more similar to those in other peak hours, and

travel times in off-peak hours are also likely to be similar to each other. As a general

rule, travel times which occurred around the same time of the day are more similar to

each other. This forms the basis for the temporal neighbourhood scheme introduced in

Chapter 4. According to the scheme, historical trajectories which occurred within a fixed

temporal neighbourhood (of 30 minutes or 1 hour) of the test trajectory are more reliable

for prediction that those outside the neighbourhood.

18


30/69

Figure 3.4: Comparison between weekday peak and weekday off-peak trips.

Figure3.4 shows the space-time trajectories4 for all the 2,212 trajectories. The blue

trajectories happened in the peak hours whereas the green ones happened in the off-peak

hours during the weekdays. Clearly, the peak hour trajectories have more variance than

those in off-peak hours.

Figure3.5shows a heat-map that represents the correlation matrix which was obtained

by binning all the historical travel times on Segment 28 into 14 bins (corresponding to the 14

working hours in a day) and calculating the Pearsons correlation coefficients among them.

The diagonal squares are all white (correlation = 1) since these represent the correlation of

one bin with itself. It is clear from the heat-map that the squares closer to the diagonal are

whiter than those away from the diagonal. This means that the historical travel times which

occurred temporally closer (within a radius of 1-2 hours) to each other are more correlated

to each other. This conclusion forms the basis of the temporal neighbourhood feature for

the selection of similar historical trajectories, as discussed in Section4.7.

Travel times are not only related to the hour they happen, but also to the day on which

4A plot between the cumulative distance and the cumulative time taken to cover that distance.

19


31/69

Figure 3.5: Correlations between travel times occurring in different hours of the day

20


32/69

Figure 3.6: Comparison between weekday and weekend trips.

they happen. To verify the correlations between travel times and the day, we classified the

days into 2 classes namely, weekday and weekend. As Figure3.6indicates, travel times

in weekdays have higher variance than those in weekends. Thus, the assumption of taking

weekday/weekend as a discriminative feature for trajectory selection is also valid. A similar

analyses across different days of the week is shown in Figure3.7and it can be seen that they

are not distinctly different from each other and hence they were not separately analysed.

From the above analyses, it can be concluded that several patterns exist in the travel

times of buses moving on the same route. In the present case, the weekday/weekend pat-

tern and the intra-day hourly pattern are the most significant. Based on these patterns, two

schemes based on the temporal features of the trajectories are proposed in Chapter 4. From

the correlation analysis, it was concluded that the correlation between closely spaced seg-

ments is significant. This forms the basis for the passed segments scheme proposed in the

next chapter.

21


33/69

Figure 3.7: Comparison between the weekdays.

22


34/69

CHAPTER 4

THE FRAMEWORK AND THE CLUSTERING

ALGORITHM

Through the data analysis presented earlier in Chapter 3, we observed the correlations be-

tween the segment travel times and the various trajectory features. Cluster analysis was

adopted for the identification of the most correlated trips and is discussed in this chapter

along with the other schemes based on temporal features. Using the identified trips as input,

a novel travel time prediction framework, called Historical Trajectory and Kalman Filter

based Arrival/Travel Time Prediction (HTKFTP), based on a large collection of historical

bus trajectories is developed, the details of which are discussed in the next chapter. This

chapter focuses on the historical trajectory selection part whereas the prediction algorithm

is discussed in details in Chapter 5. The section below defines the necessary terminology

that are used in this framework.

4.1 Terms and Definitions

Since the buses are travelling on fixed routes, the geometrical routes in a two-dimensional

space can be represented in a one-dimensional space, where the position of each point on

the route is the distance from the start of the route. A route can be considered as consisting

of points on it and in a classical way, we choose a series of points to represent a route. In

our case, each point on the route is at a distance from the origin which is a multiple of

200 meters (along the route), so that the entire route is split into segments of 200 meters

length. This choice, of having smaller segments to represent the route, was made to closely

capture the pattern of segment-wise travel times along the route for a particular journey.

The various terms used in this study, are defined below.

1. A raw routeRraw is represented as a sequence of points, p0, p1,...,pn.

Each point,pi , stands for the starting point of theith segment and its value denotes the

total distance along the route from the starting point of the route to the end of(i 1)th


35/69

segment. Thus,pi < pi+1. nis the total number of points on the route including the origin

and destination.

2. A segmentSi is a part of a route between two adjacent points piandpi+1.

3. A routeR is represented by a sequence of segments,S0, S1,...,Sn1.

The value ofSidenotespi+1-pi. Our goal is to predict the arrival times at the bus stops.

Each bus stop has a latitude and longitude which lies on the route.

4. A route is also represented by a sequence of bus stops, B0, B1,...,Bm.

The value ofBi denotesS0+ ...+Sl1+ d if the bus stop Bi lies on the segmentSl.

Here,d is the distance along the route frompl (end point of segment Sl1) to the locationof bus stop Bi. The trajectory data of a bus journey consists of a series of time-stamped

locations of the bus on the route.

5. A raw trajectoryTraw is represented as a sequencep0, t0, ..., pn, tn.

piR andtidenotes the travel time fromp0to pi.

6. A trajectoryTis a sequencet0,..., tN.

ti denotes the travel time on Si andN ( =n 1) is the number of segments on the

route. During a complete trajectory, a bus generates travel times on all the segments on the

route. Therefore, given Mhistorical trajectories, there are Mtravel times for each segment.

7. For a segment Si, there is a corresponding sequence of travel times,tSi0

, ..., tSiM1.

tSij is the travel time of a bus on this segment in the(j+ 1)th trajectory andMis the

number of historical trajectories. This sequence is denoted asSTi for Si.

For a particular route, all the historical trajectory data can be stored in a table format in

which the columns represent theattributes of the trips such as start time from the origin,

date of trip and the travel times on each segment whereas the rows (or records) represent

the individual trajectories. As discussed later in this chapter, the historical travel times on

a particular segment are clustered into smaller groups so as to minimize the within-cluster

variance for each group.

8. Given a sequence of historical travel timesS Ti for a segment Si, it can be split into

a sequence of intervals (orclusters)SCi := CSi0 ,...,C

SiK1.

24


36/69

Kis the number of travel time clusters for Si. Note that,Kis a random variable which

depends on the variance of the historical travel times on the segment. The process from STi

toSCi is explained later in this chapter.

4.2 Problem Formulation

Consider a bus route R:= S0,...,SN1with Nsegments. For a bus travelling on segment

Si, its current (incomplete) trajectory Tcurr can be represented as a sequence of travel times

on the passed segments, i.e. Tcurr := tcurr0 ,..., tcurri (0 i N). Let d be the

distance of bus from pi+1 (end point ofS

i) along the route. Suppose the bus stopB

j at

which to predict the bus arrival time lies onSl (l > i) anddbe the distance ofBj frompl

(start point ofSl).

GivenMhistorical trajectories andTcurr , we aim to develop an effective framework to

predict the travel times ti, ..., tl on the segmentsSi,...,Sl. The arrival time of the

bus atBi is given by,

ABi =T+

d

Si

. ti+ ti+1+ ... + tl1+

dSl

. tl (4.1)

whereTis the current time and ABi is the arrival time atBi.

4.3 Overview of the Framework

In this section, we first provide an overview of the proposed HTKFTP system frameworkand then discuss the details of the cluster analysis (in the passed segments scheme) car-

ried out for pattern identification. Figure 4.1 shows the system design of the HTKFTP

framework. As illustrated, the proposed HTKFTP system (i.e., a location based service)

continuously collects bus trajectory data from GPS-equipped buses which report the latest

bus status including time-stamped geographical coordinates of the bus and instant speed.

The HTKFTP server is responsible for receiving and storing the trajectory data, monitoring

the incomplete trajectories of on-going buses, and making prediction of bus travel time on

the routes in response to (i) passenger enquiries and (ii) real time updates of bus arrival

25


37/69

Figure 4.1: Overall architecture of the HTKFTP framework

times at bus stops. As shown in Figure 4.1, the HTKFTP server consists of three modules:

a) Real-time Bus Status Monitoring (RBSM) module; b) Travel Time Prediction (TTP)

module; and c) Nearest Neighbour Search (NNS) module.

The RBSM module is responsible for communicating with the buses to receive bus

status information and GPS data updates of the on-going trajectories. Once an update from

a bus b reaches the server, RBSM catches the status (such as current bus coordinate and

new time stamp) ofb, extracts features associated with the developing trajectoryTb, andstores the information as part ofTbin the historical trajectory repository.

The TTP module is responsible for predicting the arrival times of buses at bus stops,

which can be reduced to a problem of predicting the travel times of buses on their remain-

ing route segments. As mentioned, the TTP module can be invoked to make predictions by

(i) a passenger enquiry; or (ii) the real-time updates of bus arrival information at stops. The

former arrives on demand and the latter happens periodically. In this paper, for simplicity,

we focus on predicting the travel time of a bus, given its current location, on remaining seg-

ments of its journey on the bus route. Moreover, instead of constantly making predictions,

26


38/69

we assume that TTP is invoked every time when RBSM receives an update that the bus has

crossed a segment (including the GPS data of bus location) and passes the required input

parameters for prediction to TTP. Our idea behind the TTP module is to use a few best

matches for the ongoing trajectory as inputs to Kalman Filter, which, efficiently predicts

the travel times on the future segments by employing a robust mechanism.

As illustrated in the figure, TTP relies on NNS module to search for similar trajectories

effectively and efficiently. As there could be different ways to identify the sample set of

similar trajectories, different notions of similarity could be explored to ensure the effective-

ness of TTP. On the other hand, with a massive amount of historical data, it is infeasible to

make exhaustive comparison between the trajectory of current bus journey against all the

historical trajectories in the database. To ensure the search efficiency, we create indices of

trajectories and related patterns in the NNS module to avoid retrieval of irrelevant trajec-

tories that are not helpful for our travel time estimation. In other words, we only fetch a

relatively small set of candidate trajectories and return them back to TTP.

The HTKFTP is introduced above as a general framework to support travel time predic-

tion. The remaining task is to devise similarity trajectory based prediction schemes which

first invoke the NNS module to retrieve a sample set of trajectories for making effective

travel time estimation in the TTP module. Based on our data analysis, we observed the

travel time correlation between two segments and the travel time patterns corresponding to

some temporal features such as hours and days. Therefore, we follow these observations

to introduce two schemes based on passed segments and temporal features. As their names

suggest, these two schemes use the passed segments (PS) and temporal features (TF) of an

on-going bus journey, respectively, to identify similar trajectories for prediction.

4.4 Trajectory Search based on Passed Segments Scheme

In PS scheme, the prediction is done by finding the historical trajectories "similar" to the

current one in terms of the travel times on the segments already crossed by the moving bus.

Thus a similarity measuring algorithm has to be taken into consideration. As mentioned

before, the conventional algorithms measuring similarity of time series such as, Lp-norm,Dynamic Time Warping(DTW) andLongest Common Subsequence(LCSS), are not appro-

27


39/69

priate in this project because those algorithms are highly sensitive to any error or outlier

in the data. As a result, a slight variation in the collected data might result in dramatical

mismatches between the current trajectory and historical trajectories. In addition, these

algorithms only evaluate the overall similarity of the whole trajectory, rather than the sim-

ilarity of trajectories with respect to each segment. To address the above problems, we

propose a new similarity measure that takes into account the similarity between two tra-

jectories on each segment. Given two trajectories,t0, ..., tn andt

0, ..., t

n, we

compare each pair of travel timesti andt

i. If the difference between each pair is less

than a threshold specific to that segment, the two trajectories are considered "similar". This

method improves the conventional distance measure algorithms in that, for two "similar"

trajectories, not only the whole ones, but also the corresponding segments should be similar.

However, this method is limited by the low efficiency that is caused by searching for

similar travel times based on each segment, especially when the number of historical tra-

jectories is large. To overcome this, we can allocate travel times into clusters and match the

current travel time to the cluster averages to find the appropriate cluster.

To better illustrate this problem, we provide the following example. For a specific route,

given a number of historical trajectories on this route, we can create a table with attributes

(columns) corresponding to the travel times of each segment of the route and a record

corresponding to each historical trajectory (Table4.1).

Table 4.1: An example of segment-wise travel times on historical trajectories.

Trajectory ID Segment0 Segment1 Segment2 Segment3

Trajectory1 20 155 63 29


Trajectory3 15 262 55 37Trajectory4 93 90 73 21


. . . . .

. . . . .

. . . . .

TrajectoryM t0 t1 t2 t3

By using the clustering algorithm that will be discussed next, we partition each column

into several non-overlapping ranges. Each range contains at least one value and each valueonly falls in one range. This Table4.1can be transferred into a table as shown in Table4.2,

where the number of ranges for each segment is not necessarily equal.

28


40/69

Table 4.2: An example of partitioned segment-wise travel times after application of cluster-

ing algorithm.

Segment0 Segment1 Segment2 Segment3

15;20 75;89;90 55 21

32 155;180 61;63 26;29

68 262 69;73 33;36;37

82;93 - 77 -

4.5 The Clustering Algorithm

In this section we consider an algorithm used to partition each sequence of travel time

values,S Ti, into a sequence of clustersSCi. Since,S Ti is a sequence of numerical values

and can be represented in a one-dimensional space. Splitting such a sequence is actually to

allocate a set of one-dimensional data into clusters. Leeet al.(2012) compared two robust

clustering algorithms namely,K-meansandV-clustering. As they pointed out in their work,

the K-means algorithm has two limitations. Firstly, the initial cluster centroids are chosen

randomly and different choices may cause different clustering results. Another issue is how

to determine the value of K. With no common direction on this problem, it is hard to offer

a perfect value of K. They also found V-clustering to be performing better (in the passedsegment scheme) than K-means with the help of experiments with real-world data. So, in

this study, we concentrate only on the V-clustering algorithm.

This V-clustering algorithm was introduced by Yuan et al. (2010) to allocate a sorted

list of one-dimensional data into clusters. In this algorithm, a list of values is first sorted.

Then it is split into clusters in an iterative manner. At each iteration, the list is split into two

parts and the weighted average variance (WAV) is calculated for the resulting child lists.

An optimum split is found out that minimizes the WAV of the resulting child lists. The

WAV for a split at theith element of the list is defined in Equation4.2.

wavi=

L1i

L

V ar(L1i) +L2

i

L

V ar(L2i) (4.2)

where |Li1| and |Li2| are the cardinalities of the resulting child lists for the ith split and

V ar(Li1)and V ar(Li2)are their respective variances. The list is recursively partitioned so

that the running time of the clustering algorithm for a segment with M historical travel

times becomes O(log M). Hence, the running time for the entire trajectory database is

29


41/69

O(Nlog M) (which is fast), where N is the total number of segments on the route. The

iteration is stopped when each cluster is left with a minimum number of travel times (or

minimum number of trajectories, MNT) which is a tunable parameter (i.e. its value is

decided to strike a balance between minimizing the errors in prediction and maximizing the

computational speed). Each cluster for a segment is associated with a cluster average, i.e.,

the average of all the travel times in it. The selection of the values of various parameters of

the clustering algorithm is made after the experiments with the real-world data as discussed

in Chapter 6.

4.6 Nearest Neighbour Search in Passed Segments Scheme

Given a current trajectory t0, t1, t2, with the passed segments S0, S1 andS2, let

t2 falls in a certain cluster for S2 and we take it as the match to t2. All trajectories

whose travel times onS2fall in the matching cluster are marked. Matching is usually done

by finding the cluster for the particular segment, whose cluster average is closest to t2

(which is the current trajectorys actual travel time on S2). The same operation is applied

toS1andS0 and then we can find trajectories whose travel times of the three past segmentfall in all the matched clusters. This method, known as Segment filtering, was introduced

byLeeet al.(2012). Since for each of the three passed segments, the historical trajectories

travel times are similar to the current trajectory, they can be considered as "similar" to the

current trajectory and can be used for prediction. However, the Segment filtering method

has a limitation when the number of historical trajectories is small (i.e.


42/69

searched to find the match.

4.7 Similarity based on Temporal Features

Besides passed segment (PS) scheme, we also propose a scheme that uses features inside

the historical data that are directly related to the travel time. Using similar trajectories found

from the PS scheme, we can provide satisfactory predictions. However, this method cannot

guarantee accuracy under all the circumstances. For example, when unusual events happen

on a future segment, it is hard to make a reliable prediction from historical data because the

events might never have happened in the history (limited by the amount of historical data

collected). Fortunately, resorting to features related to traffic information on the current

segment, we can make predictions of travel times by first selecting trajectories by matching

the temporal features and then using the PS scheme. For example, the time when a bus

enters a segment is important because the traffic changes along with time of the day. It is

common that during peak hours in the morning and evening, congestions happen with a

high probability. Also, intuitively, the segment-wise travel times on two trajectories close

together temporally may bear a high correlation with each other. As verified by Figure 3.6in Chapter 3, the weekday and weekend trips have different variances in their space-time

trajectories. Hence, in the final hybrid scheme, in order to make predictions for an ongoing

trip, the day on which it is occurring is first used to select weekday or weekend trajectories.

On this set of trajectories the TF (temporal neighbourhood) and PS schemes are applied in

sequence to find the final set of similar trajectories.

4.8 Summary

The final refined set of similar trajectories that results from the application of the hybrid

scheme introduced above, is used for prediction of travel times on the upcoming segments

of the ongoing trajectory. Experiments to support our claim that the hybrid scheme is

more effective in prediction than the individual ones, were carried out with real-world data

as discussed in Chapter 6 on performance evaluation. The prediction algorithm based onKalman Filter and the modifications made to the base Kalman algorithm are explained in

details in the next chapter.

31


43/69

CHAPTER 5

THE PREDICTION ALGORITHM

In Chapter 4, we explored the various schemes to search for similar historical trajectories

that are effective for travel time prediction (TTP). The problem now, is to use the travel

times of the identified similar trajectories to predict for the current trajectory. The simplest

way to predict the travel time on an upcoming segment of the current trip is to use the mean

(or median) of all the travel times from the identified similar trajectories on the correspond-ing upcoming segment. Another way is to give weights to the individual trajectories before

calculating the mean. The weight given to a similar trajectory can be the inverse square of

its Euclidean distance1 from the ongoing trajectory, as explained in Larose(2005). How-

ever, in this study, we focus on a robust, short-term prediction technique based on Kalman

Filter (KF) which can take into account the associated variability to a certain extent. Before

moving on, we review some previous work involving Kalman Filter for travel time predic-

tion including those which were attempted in heterogeneous traffic conditions prevalent in

India.

5.1 Travel Time Prediction using Kalman Filter

The first introduction of Kalman Filter dates back to 1960, whenKalman(1960) published

his famous paper describing a recursive solution to the discrete-data linear filtering problem.

The Kalman filter is a set of mathematical equations that provides an efficient computational

(recursive) means to estimate the state of a process, in a way that minimizes the mean of

the squared error. As mentioned inWelch and Bishop(2006), the filter is very powerful in

several aspects: it supports estimations of past, present, and even future states, and it can

do so even when the precise nature of the modelled system is unknown.

In the literature of travel time prediction,Chein and Kuchipudi(2002),Liuet al.(2006),

Nanthawichit et al.(2003),Chen and Chein(2001) andYang(2005) are some of the earli-

1Square root of the sum of squares of differences between the corresponding segments of the historical

and the current trajectory (till the segments crossed in the current trip)


44/69

est to introduce KF.Nanthawichitet al.(2003) andYang(2005) explored the possibility of

using GPS probe vehicle data into KF for travel time prediction. Vanajakshi et al. (2009)

is one of the earliest attempts that used KF with GPS probe vehicle data for short-term

travel time prediction under heterogeneous traffic conditions such as those prevalent in In-

dia. From their travel time variation plots (across the route), the authors concluded that

the travel time patterns along the route were more related for consecutive vehicles (with a

headway15 min) on the same day. Weekly and daily patterns were not as significant as

the above one. Hence, they used the travel times of the previous two vehicles for predicting

the travel time of the test vehicle (the ongoing trip). However, when the headways between

the consecutive vehicles are more (1 hour), the accuracy of the approach decreases (This

is a serious issue when the previous vehicles passed during an off-peak hour and the test

vehicles passes in a peak hour, or vice versa.). Since, inputs to KF in this method are fixed

most of the times, there is no means to rectify the accuracy once one of the above mentioned

issues creep in. Hence, there was a need to modify the method such that it uses dynamic in-

puts for prediction in order to address the prevalent traffic condition at the moment. This is

where the similar trajectory search as discussed in Chapter 4, can be helpful. Based on the

latest actual travel times of the test vehicle, the trajectory search algorithm finds all the his-

torical trajectories which occurred under the same traffic conditions as the current one. In

Chapter 6, we prove that the dynamic input method outperforms the static input method by

using real-world data. In the following section, we discuss the base KF algorithm as men-

tioned inVanajakshiet al.(2009). In the subsequent sections, we discuss the changes made

to both the base KF algorithm and the trajectory search algorithm to effectively integrate

them for travel time prediction.

5.2 The Base KF Algorithm

It is assumed that the evolution of travel time between the various segments is governed by,

ti+1=aiti+ wi (5.1)

whereti is the travel time taken for coveringSi (theith subsection),ai a parameter that

relates the travel time taken inSi to the travel time taken inSi+1andwi the process distur-

33


45/69

bance associated withSi. The measurement process was assumed to be governed by,

zi= ti+ vi (5.2)

wherezi is the measured time of travel inSi andvi the measurement noise. It was further

assumed thatwi andvi are zero mean white Gaussian noise signals with Qi andRi being

their corresponding variances.

The prediction algorithm requires as input, at least two trajectories in the form of

segment-wise travel times. Trajectory which is more similar to the current one is called

basetrajectory (denoted byTbase) and the other one is calledcorrectiontrajectory (denoted

byTcorr). The data obtained fromTbase was used to obtain the value ofai for each subsec-

tion. The data fromTcorr were used in the prediction algorithm to obtain the estimate of

travel time of the test (or the ongoing) trajectory (denoted by Ttest). Following are the steps

involved in the algorithm:

1. The travel time data from Tbase was used to obtain the value ofai through ai =

tTbasei+1 /tTbasei ,i = 1, ..., (N1), wheret

Tbasei is the travel time taken inTbase to

coverSi.

2. The discretisation is carried out over space rather than over time (as is done in tradi-

tional applications of the KF). Let tTtesti denote the travel time taken by inTtest to

coverSi. It is assumed thatE[tTtest1

] = t1, andE[(tTtest1 t1)

2] =P1, where

t1is the estimate of the travel time in TtestonSi.

3. Fori = 2, ..., (N1), the following steps are performed:

(a) The a priori estimate of the travel time is calculated using ti+1 = ait+

i ,

where the superscript - denotes the a priori estimate and the superscript +

denotes the a posteriori estimate.

(b) The a priori error variance (denoted by P) was calculated using Pi+1= aiP+

i ai+

Qi:

(c) The Kalman gain (denoted byK) was calculated usingKi+1= P

i+1

Pi+1

+Ri+1:

(d) The a posteriori travel time estimate and error variance were calculated using,

34


46/69

respectively, t+i+1= ti+1+ Ki+1[zi+1

ti+1]andP+

i+1= [I Ki+1]P

i+1,

where the data measured fromTcorr was used for providing the values ofzi+1

in the equation to calculate t+i+1.

Thus, the objective here is to predict the travel times ofTtest using the travel time data

obtained fromTbase andTcorr . When theTtestis inSi, its travel time for Si+1, which is de-

noted byti+1, is predicted. The KF algorithm works like a predictor-corrector algorithm.

The a posteriori estimate ofti of theTtest is used to obtain the a priori estimate ofti+1

(this being the prediction step) and then the measurement of the travel time, Tcorr inSi+1

(which is denoted by zi+1in the equations in step 4d above) is used to obtain the a posteriori

estimate ofti+1ofTtest (this being the correction step). In the following section, we dis-

cuss the modifications made to the trajectory search algorithm and the base KF algorithm

in order to integrate them and to tackle a few issues concerned with the variance of travel

times.

5.3 Integration of Trajectory Search and Prediction algo-

rithms

As we discussed in the previous section, the KF based algorithm needs only two best

matched trajectories for travel time prediction. Based on the actual travel times received

in real-time from the test vehicle, the trajectory search algorithm finds similar historical

trajectories with travel time patterns matching that of the current one. The task now, is to

rank the matched trajectories based on some metric and send the top two to the prediction

algorithm. To accomplish this, for each matched trajectory, its Euclidean distance from the

test trajectory is found out using the equation,

ED =

(tTtest1 t

Thist1 )

2 + (tTtest2 tThist2 )

2 + ... + (tTtestm tThistm )2 (5.3)

whereED is the Euclidean distance between the test trajectory and a matched historical

trajectory, tTtesti is the travel time onSi for the test vehicle, tThisti is the travel time on

Si for a matched historical trip and m the number of segments crossed by the test vehicle

when the request is made. The above Euclidean distance gives the measure of similarity

35


47/69

Figure 5.1: Variation of travel time variance across the segments of 19B route

between two trajectories with respect to their individual segment travel times. The matched

trajectories are now ranked according to the increasing values of their EDs from the test

trajectory. The top two are sent to the prediction algorithm. As the test vehicle moves from

one segment to the next one, with the newly available actual travel time of test vehicle, the

trajectory search algorithm again finds the best matches from history, ranks them and sends

the top two to the prediction algorithm, which updates the previous predictions with more

accurate ones, thus making the process dynamic in nature.

5.4 Modifications

High variances in travel times during certain periods of the day and on certain segments,

leading to higher prediction errors on selected trips or segments was the main issue faced

by the existing algorithm. As can be seen in the box plots in Figure 3.3(Chapter 3), during

the peak hours, besides the median travel time (thick line inside the box), the variance of

travel times also increases (indicated by increased height of the box). Figure5.1below,

shows that the variance of travel times is also high in certain segments on the route. Each

line in the plot is obtained by calculating the variance of travel times at each segment for

the trips occurred in a two hour band in the history.

To address the high variance (to some extent), theQi andRi values which represent

36

7/22/2019 B Tec

b tech project thesis

Documents