b tech project thesis
TRANSCRIPT
-
7/22/2019 B Tech Project Thesis
1/69
DATA ANALYTICSBASED
DYNAMIC PASSENGER INFORMATION SYSTEM
A Project Report
submitted by
RAKESH BEHERA
in partial fulfilment of the requirements
for the award of the degree of
BACHELOR OF TECHNOLOGY
TRANSPORTATION DIVISION
DEPARTMENT OF CIVIL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
MAY 2014
-
7/22/2019 B Tech Project Thesis
2/69
CERTIFICATE
This is to certify that the project report titled Data Analytics Based Dynamic Passenger
Information System, submitted byRakesh Behera, to the Indian Institute of Technology,
Madras, for the award of the degree ofBachelor of Technology, is a bonafide record of
the research work done by him under my supervision. The contents of this report, in full orin parts, have not been submitted to any other Institute or University for the award of any
degree or diploma.
Dr. Lelitha Devi V.
Project Guide
Associate Professor
Dept. of Civil Engineering
IIT-Madras, 600 036
Prof. Meher Prasad A.
Head of the Department
Professor
Dept. of Civil Engineering
IIT-Madras, 600 036
Place: Chennai
Date: 19th May 2014
i
-
7/22/2019 B Tech Project Thesis
3/69
ACKNOWLEDGEMENTS
My earnest thanks to Dr. Lelitha Devi, for her support throughout the study. It is through her
guidance that the project has gained structure and been accomplished in such a short span
of time. Her foresight and expertise has helped us make the right choices in the project
and otherwise. I am thoroughly indebted to her for the amount of time she has spent in
reviewing my analyses and reports. I thank her for her belief in my potential in carrying out
the tasks involved. I consider it a privilege to have worked under her guidance.
I also owe my gratitude to Dr. Shankar Ram C. S. for his valuable inputs. His contribu-
tion could not have been substituted by anyone else. I also thank Dr. J. Murali Krishnan for
the constant support and encouragement that he has provided me throughout my academic
life at IITM. I take this opportunity to thank Akhilesh, Krishna, Siddharth and Anil for the
help offered by them in data acquisition and the development of the online version of the
framework. I would also like to acknowledge all the other project staff and students at the
Centre of Excellence in Urban Transportation, IIT Madras.
Friends have been an integral part throughout the stay here at IIT Madras. Life at IITM
cannot be complete without them. I thank all my friends and wing mates for making my
stay here at IIT Madras, a memorable one.
Finally, I would like to thank my parents and my younger brothers for their enduring
support and unconditional love, without which this project would not have been possible.
ii
-
7/22/2019 B Tech Project Thesis
4/69
ABSTRACT
KEYWORDS: Travel Time Prediction, Historical Trajectory Search, Kalman Fil-
ter, V-clustering.
The present study developed a reliable system for real-time bus arrival/travel time predic-
tion under heterogeneous traffic conditions that exist in India. The study is different from
(and more challenging than) most of the previous studies which involved homogeneous
traffic conditions. To accomplish the above goal, a robust framework namely, Historical
Trajectory and Kalman Filter based Travel/Arrival Time Prediction (HTKFTP) is proposed
in this study. The proposed framework has two major components: (i) similar trajectory
search; (ii) travel time prediction using similar trajectories. Through the data analysis
performed, travel time correlations (between spatially close stretches of road) and other
temporal patterns in travel times were identified, which were used for the development of
various schemes for the selection of historical trajectories. The prediction algorithm based
on Kalman Filter was also improved to account for the high variance in travel times on cer-
tain locations or during certain time of the day. The proposed schemes were corroborated
using real-world GPS trajectory data collected from the Metropolitan Transport Corpora-
tion (MTC) buses in Chennai.
iii
-
7/22/2019 B Tech Project Thesis
5/69
TABLE OF CONTENTS
CERTIFICATE i
ACKNOWLEDGEMENTS ii
ABSTRACT iii
LIST OF TABLES vii
LIST OF FIGURES viii
ABBREVIATIONS ix
NOTATION x
1 INTRODUCTION 1
1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW 6
2.1 A Brief History of Traffic Prediction . . . . . . . . . . . . . . . . . . . 6
2.2 Approaches Exploiting "Similarity" . . . . . . . . . . . . . . . . . . . 9
2.3 Trajectory Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 DATA ANALYSIS 12
3.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Data Cleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Extracting Trip Data . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Calculation of Segment-wise Travel Times . . . . . . . . . . . 15
iv
-
7/22/2019 B Tech Project Thesis
6/69
3.3 Correlation Between Segments . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Travel Time Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 THE FRAMEWORK AND THE CLUSTERING ALGORITHM 23
4.1 Terms and Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Trajectory Search based on Passed Segments Scheme . . . . . . . . . . 27
4.5 The Clustering Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Nearest Neighbour Search in Passed Segments Scheme . . . . . . . . . 30
4.7 Similarity based on Temporal Features . . . . . . . . . . . . . . . . . . 31
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 THE PREDICTION ALGORITHM 32
5.1 Travel Time Prediction using Kalman Filter . . . . . . . . . . . . . . . 32
5.2 The Base KF Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Integration of Trajectory Search and Prediction algorithms . . . . . . . 35
5.4 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 PERFORMANCE EVALUATION 38
6.1 Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Parameter Optimization in Passed Segment Scheme . . . . . . . . . . . 39
6.2.1 Spatial lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2.2 Minimum Number of Trajectories (MNT) in a Cluster . . . . . 40
6.3 Evaluation of the PS scheme . . . . . . . . . . . . . . . . . . . . . . . 41
6.4 Evaluation of the Weekday/Weekend Temporal Feature . . . . . . . . . 41
6.5 Evaluation of the Temporal Neighbourhood Feature . . . . . . . . . . . 42
6.6 Evaluation of the base KF Algorithm for Prediction . . . . . . . . . . . 42
6.7 Evaluation of the Adaptive KF Algorithm . . . . . . . . . . . . . . . . 44
6.8 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 SUMMARY AND CONCLUSIONS 47
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
-
7/22/2019 B Tech Project Thesis
7/69
7.3 Scope for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 48
A PYTHON CODE LISTING FOR CLUSTERING ALGORITHM 49
A.1 Method for creating clusters from similar trips . . . . . . . . . . . . . . 49
A.2 Auxiliary method for finding optimum splits in the clustering algorithm 51
A.3 Method for finding nearest neighbours from clusters. . . . . . . . . . . 51
-
7/22/2019 B Tech Project Thesis
8/69
LIST OF TABLES
3.1 A sample of the raw data received from the GPS devices on the buses. . 13
3.2 A sample of data records after transformation. . . . . . . . . . . . . . . 14
4.1 An example of segment-wise travel times on historical trajectories. . . . 28
4.2 An example of partitioned segment-wise travel times after application of
clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
-
7/22/2019 B Tech Project Thesis
9/69
LIST OF FIGURES
3.1 Pearsons correlation coefficients versus the segment distance. . . . . . 16
3.2 Average correlation coefficient versus the segment distance.. . . . . . . 17
3.3 Travel time analysis by hours of the day . . . . . . . . . . . . . . . . . 18
3.4 Comparison between weekday peak and weekday off-peak trips. . . . . 19
3.5 Correlations between travel times occurring in different hours of the day 20
3.6 Comparison between weekday and weekend trips. . . . . . . . . . . . . 21
3.7 Comparison between the weekdays. . . . . . . . . . . . . . . . . . . . 22
4.1 Overall architecture of the HTKFTP framework . . . . . . . . . . . . . 26
5.1 Variation of travel time variance across the segments of 19B route . . . 36
6.1 Optimum values of parameters involved in the clustering algorithm. . . 40
6.2 Comparison of MAE for individual test trips before and after adding the PS
scheme to the naive method. . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Comparison of MAE for individual test trips before and after adding the
weekday/weekend feature to the PS scheme. . . . . . . . . . . . . . . . 43
6.4 Comparison of MAE for individual test trips before and after adding the
temporal neighbourhood feature. . . . . . . . . . . . . . . . . . . . . . 43
6.5 Comparison of MAE for individual test trips before and after using the base
KF algorithm for prediction. . . . . . . . . . . . . . . . . . . . . . . . 44
6.6 Comparison of MAE for individual test trips before and after using the
Adaptive KF algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Improvement of the mean MAE (over all the test trips) throughout the evo-
lution of the method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.8 Comparison between HTKFTP and the prediction method using static in-
puts in KF.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
viii
-
7/22/2019 B Tech Project Thesis
10/69
ABBREVIATIONS
AI Artificial Intelligence
ANN Artificial neural networks
DTW Dynamic Time Warping
ED Euclidean Distance
GPS Global Positioning System
HTD Historical Trajectory Database
HTKFTP Historical Trajectory and Kalman Filter based Travel/Arrival Time Prediction
HTTP Historical Trajectory based Travel time Prediction
KF Kalman Filtering
k-NN k-Nearest Neighbors
LCSS Longest Common Subsequence
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MLR Multivariate Linear Regression
MTC Metropolitan Transport Corporation (Chennai)
NNS Nearest Neighbour Search
RBMS Real-time Bus Status Monitoring
SARIMA Seasonal Autoregressive Integrated Moving Average
SVR Support Vector Regression
TTP Travel Time Prediction
ix
-
7/22/2019 B Tech Project Thesis
11/69
NOTATION
Correlation coefficient between two variables
Rraw A raw route represented as a sequence of points, pis
Si A segment of road between two points,pi andpi+1
R A raw route represented as a sequence of segments,Sis
Bi Theith bus stop on a route
ti Time taken to reach a pointpi on a route, starting fromp0
Traw A raw trajectory represented as a sequence pairs of the form (pi, ti)
ti Actual time taken to cover a segmentSi
ti Predicted travel time onSi
ABi Actual arrival time of the bus at the bus stopBi
ABij Thejth predicted arrival time at the bus stopBi
T A trajectory represented as a sequence oftis
STi List of historical travel times on segmentSi
SCi List of clusters or intervals forSi
CSii Theith cluster forSi
Tcurr The current (or incomplete or test) trajectory. Also denoted asTtest
wavi The weighted average variance for a split at theith element of a list.
ai Travel time evolution factor fromSito Si+1
wi Process disturbance in travel time evolution atSi
zi Measured travel time onSi
vi Measurement noise associated withSi
Qi Variance of the historicalwis forSi
Ri Variance of the historicalvis forSi
x
-
7/22/2019 B Tech Project Thesis
12/69
CHAPTER 1
INTRODUCTION
1.1 Motivation
With the ever-increasing number of vehicles on roads in urban areas, traffic congestion has
become one of the most serious problems facing the society, especially the commuters.
In India, the problem is more prominent in the metropolitan cities such as Mumbai, New
Delhi, Chennai, etc. One of the reasons people are shifting to private transportation is the
unreliability of the public transportation systems (Bende,2012). Holeywell(2013) points
out that, travellers care most about getting picked up from their stop in 10 minutes or less
to be able to make their scheduled connections. It also points out that, the travellers are not
so interested in whether their rides are crowded or whether they can find a seat.
In todays busy society, information regarding arrival time or travel time of transportfrom a place to another is becoming more and more valuable. With a schedule of predicted
arrival times at each bus stop available via VMS or as mobile or web application, people
can make timely plans for their upcoming activities and business which will reduce their
anxiety caused by uncertain delays. Thus, there is a necessity for a system that can inform
the travellers about the latest travel times of the concerned buses before they make their
transit plans. This may also attract more passengers to use public transport, which in turn
can lead to lesser traffic congestion.
1.2 Background
Accurate estimation of travel times of public transportation has been a challenging research
problem that remains open for the past thirty years in the transportation research commu-
nity (Abkowitz,1981;Polus, 1978). A simple prediction approach is to adopt the averagetravel time derived from historical data. However, making constant estimation of the travel
time for a path, apparently does not capture the dynamic traffic conditions very well. Thus,
-
7/22/2019 B Tech Project Thesis
13/69
advanced techniques for travel time estimation were proposed in the early literature (Ghosh
and Knapp, 1978;Oda,1990;Nihan and Holmesland,1980). Even though the specific ap-
proaches adopted in these studies are different, they share a common idea, i.e., discover
certain regular patterns from the historical data collected over time. Some proposed to fit
historical data to statistical models such as Gaussian models, Bayesian network and Markov
Chains in order to facilitate statistical analysis (Polus,1978; Sumiet al.,1990). Techniques
based on regression models learn from historical data. They involve building of regres-
sion functions for estimating travel time in terms of various external factors (Polus,1979;
Ghosh and Knapp, 1978). A prediction is made by using known values of those factors un-
der current situation as input. Techniques based on time series models focus on discovering
internal relationship among historical time-series data in order to identify similar patterns
to make prediction under the current situation (Oda, 1990; Nihan and Holmesland, 1980).
However, the performances of the above approaches are highly constrained by the quali-
ty/quantity as well as the types of data available. For example, conventional collection of
traffic data is typically conducted by surveys or using expensive sensors deployed along the
roads at specific locations to record arrival times, traffic flow volumes, and other statistics
of vehicles.
In the recent years, due to the advent of positioning and wireless communication tech-
nologies, wireless devices equipped with Global Positioning System (GPS) have been widely
deployed on various private and public vehicles, generating massive amount of vehicle tra-
jectory data which can be used for fleet management and other transportation applications.
Time-tagged location data, usually represented in the form of trajectories, bring a great po-
tential for real-time prediction of the vehicle travel times. Among the public transportation
systems, the travel times of buses, which drive along with other vehicles on roads, are more
difficult to predict than trains and subways, which ride on exclusive paths. First, the travel
condition of a bus may easily get affected by various internal and external factors, including
accidents, weather, road construction, government policies and even temperature. Second,
for vehicles in metropolitan areas (such as Chennai), errors often exist in positional-data
acquisition due to the interference by urban canopies and other sources of errors. Thus, in
this paper, we propose a hybrid prediction framework to estimate the travel time of buses
by exploiting selected historical trajectory data and an efficient state estimation techniquecapable of making precise estimations by exploiting a series of travel time measurements.
2
-
7/22/2019 B Tech Project Thesis
14/69
1.3 Research Overview
Recently, research works on discovering traffic patterns from historical data collected from
vehicles have received significant attention (Chenet al.,2011;Li and Rose, 2011;Tiesyte.and Jensen,2009). These works show that traffic patterns exist in road segments and thus
could be used to predict the future traffic condition on the same segment and on a few up-
coming segments. This finding provides the basis for using similar trajectories to predict
the travel time of an ongoing bus journey. In this study, a new bus travel time prediction
framework, calledHistorical Trajectory and Kalman Filter based Travel/Arrival Time Pre-
diction (HTKFTP)for real-time prediction of travel time at upcoming segments (and thus
the arrival time at bus stops) of an ongoing bus journey is carried out. The basic idea behind
HTKFTP is to use a collection of historical trajectories similar to the current bus journey
to predict the travel times in future segments of the bus journey. Specifically, the HTKFTP
framework (i) identifies a setof similar trajectories as the basis for travel time estimation
instead of relying on only one historical trajectory best matching the on-going bus journey;
(ii) explores differentfeatures(e.g., travel times of passed segments as well as time/day of
the bus trajectories) to identify the sample set of similar trajectories; (iii) uses the similar
trajectories as inputs to the Kalman Filter based prediction method.
Several issues were faced in the design of the HTKFTP framework. For example, many
features are associated with the trajectories. Some of these features are categorical while
the others are numerical. Discriminative features and properly defined similarity functions
for those features needed to be used in order to identify a sample set of similar trajectories
effective for travel time prediction. To determine a set of similar trajectories based on travel
time on passed segments, the V-clustering algorithm, that partitions the whole spectrum
of travel times on a segment into a number of intervals (or clusters) was considered. To
determine a set of similar trajectories based on hours/days, exploratory data analysis in-
volving space-time trajectory plots of the historical trips was carried out. Accordingly, the
HTKFTP framework is able to retrieve the sample set of similar trajectories efficiently and
in turn use that sample set to estimate the travel times. To corroborate the proposed ideas
and evaluate the prediction schemes proposed, an empirical experimentation using real bus
trajectory data collected in Chennai, India, was conducted. This research work has made anumber of significant contributions as summarized below.
3
-
7/22/2019 B Tech Project Thesis
15/69
A new framework, namely, HTKFTP, for predicting the travel times over future seg-
ments of an ongoing bus journey based on historical trajectory data. The framework
consists of two major components: (i) similar trajectory retrieval; and (ii) travel time
estimation.
A detailed data analysis to investigate the correlation between bus travel times in
route segments and a number of trajectory features, e.g., passed segment travel time,
hours, days, etc. Based on our analysis, we select a number of trajectory features to
identify similar trajectories.
A clustering algorithm for passed segment travel times and space-time trajectory
analysis in order to group similar trajectories together. These similar trajectory clus-
ters allow us to efficiently and effectively retrieve a sample set of trajectories similar
to the ongoing bus trajectory.
An efficient state estimation technique based on Kalman Filter, capable of making
precise estimations by exploiting a series of travel time measurements in an inherent
feedback mechanism. The base estimation scheme was modified to take into account,
the large variance in the data observed at selected locations/times.
Through a comprehensive experimental study, using a real data set collected from buses
in Chennai, India, the proposed ideas were validated. The framework was evaluated in
terms of prediction accuracy. The experimental results show that the prediction scheme
proposed, significantly outperforms the baseline and state-of-the-art schemes.
1.4 Chapter Outline
The remainder of this report is organized as follows:
Chapter 2 reviews some literature in the concerned area.
Chapter 3, analyses the collected historical trajectory data.
An overview of the HTKFTP framework and the similar trajectory selection algo-
rithm, detailing its design, is discussed in Chapter 4.
4
-
7/22/2019 B Tech Project Thesis
16/69
The prediction scheme and the real-time prediction system design are detailed in
Chapter 5.
Chapter 6, reports a comprehensive experimental study using the collected real data
set of bus trajectories, queried in real-time.
Finally, we conclude this work in Chapter 7, with a summary of the work, followed
by conclusions and scope for future work.
5
-
7/22/2019 B Tech Project Thesis
17/69
CHAPTER 2
LITERATURE REVIEW
This chapter reviews the past research that has fuelled our motivation in prediction of move-
ment of vehicles. We begin by giving a brief history of traffic prediction, and review the
major research that has focused specifically on similarity-based prediction of arrival/travel
times and trajectory patterns.
2.1 A Brief History of Traffic Prediction
Research in transportation dates back to the 30s of the last century. With few vehicles on
roads and under-developed technologies, it was then almost impossible to collect significant
data about traffic conditions. Thus, studies during that time were mainly about identifying
certain rules that could be used to guide traffic management and the construction of trans-portation infrastructure. For example, the relations between traffic volumes and the weather
were reported byJohnson(1930). It justified the improvement of road surfaces during bad
weather. As another typical example, the authors ofVey and Pope(1935) verified a definite
relationship between highway lighting and highway accidents. In general, where adequate
lighting is provided, there is a substantial reduction in night accidents.
With the development of technologies and increasing number of vehicles on roads, more
data about traffic conditions could be collected, subsequently causing the emergence of
research on traffic prediction in the 50s. However, during this period, traffic data adopted
in most cases were vehicle volumes (or flow) because they were easily collected by hand.
For example, in Glanville(1955), Lighthill and Whitham(1955) and Buckley(1968), to
obtain the vehicle volume on a road, observers were placed at certain locations to record
the number of vehicles passed by. Such an approach was inefficient and made it difficult
to collect a large amount of data. Therefore, the arrival/travel time prediction did not arise
until 1970s (Wong and Sussman, 1973;Sussman et al.,1974), when traffic sensors were
widely adopted enabling researchers to have sufficient data for analysis.
-
7/22/2019 B Tech Project Thesis
18/69
Estimation of arrival/travel times, especially for buses, started to attract increasing at-
tention since the 80s (Abkowitz,1981; Polus, 1978; Sumi et al., 1990). Along with the
development of the society, congestions started happening increasingly in cities which cre-
ated a need to improve the quality of public transportation service. As the most important
aspect of public transportation service, arrival/travel time prediction became the most criti-
cal topic in traffic prediction area. At early stage of the research on this topic, constrained
by technologies, researchers had to work on data collected from traffic sensors and surveys,
off-line. Since the development of GPS devices and wireless technologies, it is possible
to collect large volume of traffic-related data in real-time. Therefore, real-time arrival/-
travel time prediction has become a hot topic since these technologies are widely applied
in public transportation system. Over the decades, researchers applied different models
and methods on real-time arrival/travel time prediction. InZhuet al. (2011), the authors
developed mathematical models taking into account the travel times on links, dwell times
at stops, and delays at intersections. The algorithm proposed in Lin and Zeng(2001) is to
provide real-time bus arrival information based on the bus location data, the schedule infor-
mation, the difference between scheduled and actual arrival times, and the waiting time at
time-check stops. Predicting methods based on historical data are also developed inTiesyte
and Jensen(2008).
With the development ofArtificial Intelligence, researchers have widely adopted Artifi-
cial Intelligence methods in real-time arrival/travel time prediction. As a result of this, travel
time prediction approaches in the modern literature can be broadly classified as model based
anddata driven. Model based approaches predict travel times using traffic flow models and
the underlying physical phenomena. For example,Krishnan and Polak (2008) explored
recurring themes in traffic conditions and used k-Nearest Neighbors (k-NN) for indirectly
predicting short term travel times using 15-minute aggregate flow data. Esawey and Sayed
(2011) used a VISSIM1 micro-simulation model of down-town Vancouver to predict travel
times using traffic volume and travel time data of nearby segments. Kalman Filtering(KF)
is one of the most widely adopted methods in travel time prediction in the recent literature.
Vanajakshi et al. (2009) used a KF based method for predicting segment-wise travel times
(using travel times of previous two vehicles) in heterogeneous traffic conditions prevalent
in Indian cities such as Chennai. KF takes into account the stochastic properties of the1VisSim is a visual block diagram language for simulation of dynamical systems and model based design
of embedded systems.
7
-
7/22/2019 B Tech Project Thesis
19/69
process disturbance and the measurement noise. It works well for short-term prediction.
Other notable works in KF include Xu et al. (2008), Shalaby et al. (2004) andZhu and
Wang(2000). Westgate et al. (2013) used a Bayesian model for travel time estimation of
ambulances using GPS data.
Data driven approaches predict travel time with the use of statistical relationships, which
are derived from historical data (travel times, speeds, volumes, etc.). The most commonly
reported data driven approaches in the literature include machine learning techniques,time
series analysis and historical averaging approaches. In machine learning techniques, the
prediction model learns some properties from several instances of historical data. For ex-
ample, Patnaiket al. (2004) used a machine learning technique called multivariate linear
regression for bus arrival time estimation using automatic passenger counter (APC) data.
Artificial neural networks(ANNs) is another most widely used method. Liuet al. (2009)
used neural networks to indirectly predict travel times using traffic volume and flow data.
ANNs has a huge advantage that it can process complex non-linear relationships. However,
it is limited by the extremely long training time. Other notable works using ANNs include
van Lint(2006), Zouet al. (2008) andBatool and Khan(2005). Besides, other machine
learning methods are also popular in recent years. Real-time prediction using Support Vec-
tor Regression(SVR) andSupport Vector Machine(SVM) has become a hot topic recently.
For example,Wuet al. (2004) used SVR for travel time prediction using highway traffic
data.Vanajakshi and Rilett(2007),Vanajakshi and Rilett(2004) andYuet al.(2006) are the
other instances of the use of SVR. Similar to ANNs, SVR is too expensive in training for
real-time updates. In a time series analysis approach, temporal patterns are identified in the
historical data and future values are predicted with the assumption that these patterns hold
in the near future. For example,Guin(2006) used a time series analysis approach called
seasonal autoregressive integrated moving average(SARIMA) to predict travel times using
historical travel time data.
Though the model based approaches provide valuable insights into the mechanisms of
traffic flow and queue dynamics, their inherent limitations hinder their application in real-
time systems. The major disadvantages include high computational complexity, intensive
model/parameter calibration, requirement for predicting traffic demand/capacity and the
degree of expertise required for design and maintenance. On the other hand, data driven
approaches can be deployed quicker and cheaper compared to model-based approaches.
8
-
7/22/2019 B Tech Project Thesis
20/69
They can provide scope for prediction when there is a large diversity (or variance) in the
historical data. In such cases, predicting using physical models which are narrow in scope
can be expensive. In this study, a data driven approach is chosen for travel time prediction
exploiting similar historical trajectories, as explained in the following sections.
2.2 Approaches Exploiting "Similarity"
Since the bus trips repeat in the same route, more or less around the same time, on dif-
ferent days, the similarity-based approach is the straightforward approach to predict future
travel times. Great amount of work has been done on identifying similar trajectories or
similar time series, in both one-dimension and multi-dimensions. Yi and Faloutsos(2000)
proposed Lp-norm to compute the Manhattan Distance or Euclidean Distance as a mea-
sure of similarity. Lp-norm is widely applied in various applications but is only available
for time series with same length. Therefore, other similarity measures are developed and
adopted. Berndt and Clifford(1994) introducedDynamic Time Warping(DTW) which was
adopted later inAssentet al.(2009) andVlachoset al.(2006). The concept ofedit distance
was introduced inLevenstein(1966) and the most widely used distance based on edit dis-tance is Longest Common Subsequence (LCSS) distance. Vlachos et al. (2002),Fashandi
and Moghaddam(2005) andHermeset al.(2009) applied LCSS as the distance measure to
fetch similar trajectories or time series. However, these algorithms tend to emphasize on the
overall similarity of the whole trajectory, without considering the similarity of trajectories
in individual or subsets of segments. Additionally, while LCSS and DTW are applicable to
trajectory data, they are highly sensitive to noises and errors. In this project, similarity mea-
sure of trajectories based on similarity of corresponding individual segments is proposed.
Recently, prediction methods based on historical trajectory data have also been developed
inJensen and Tie(2008),Tiesyte. and Jensen(2009) andTiesyte and Jensen(2008). The
authors show that the similarity between historical trajectories and current position data
of a bus can be exploited to predict bus arrival time at bus stations. This shares the same
intuition with the development of the Historical Trajectory based Travel time Prediction
(HTTP) framework inLee et al.(2012). The present project can be said to be built on top
of HTTP with the inclusion of additional features for the selection of similar trajectories
and the use of Kalman Filter for prediction. InJensen and Tie(2008), the authors devel-
9
-
7/22/2019 B Tech Project Thesis
21/69
oped a system called TransDB, that searches the historical trajectory database for the most
similar trajectory based on the passed segments of the current bus trajectory in order to
make a good prediction. The basic idea is that, based on the proposed trajectory similarity
function, thenearest neighbourhood trajectory(NNT) and the trajectory of current bus ride
are anticipated to exhibit similar travelling behaviour (in terms of travel time). Based on
this assumption, the NNT serves as a good basis for predicting the future travel time of
current bus ride without explicitly taking into account various external and internal factors.
However, Lee et al. (2012) argue that the historical trajectory that is most similar to the
passed segments of the current bus trajectory alone may not provide the best prediction of
the on-going bus ride. Thus, they collect a set of similar trajectories and adopt a statisti-
cal approach to make predictions. Additionally, they exploit different features associated
with trajectories and develop different similarity functions to find similar trajectories that
make significantly more accurate travel time predictions. Our approach varies from HTTP
in a way that it is not a statistical approach. From the data analysis studies as explained in
Chapter 4, it was found that, though the statistical approach provides satisfying predictions
for a few upcoming segments, the predictions get worse if they were made for a time far
into the future (future segments far from the current one). It was observed that the historical
trajectories within a temporal neighbourhood of 30 minutes around the ongoing trajectory,
are more significant in improving the prediction accuracy for farther future segments. Addi-
tionally, with the error feedback mechanism inherent in the KF based prediction algorithm,
the accuracy tends to improve from one future segment to the next one during prediction.
2.3 Trajectory Patterns
Patterns of historical trajectories are described in two classes: trend and periodicity. The
trend represents a general systematic linear or non-linear component that changes over
time and does not repeat or at least does not repeat within the time range captured by data.
The periodicity represents the component that repeats itself in certain intervals of time. In
Wuet al. (2003), the authors display the daily periodicity from historical data of travel
times around the same location. Zhu et al. (2009) also conducted an analysis to verify
the existence of periodicity of speeds over time on a route segment. Chen et al. (2011)
andLi and Rose (2011) verify the pattern by measuring the correlation between the traffic
10
-
7/22/2019 B Tech Project Thesis
22/69
on a specific route of different time periods. Vanajakshiet al.(2009) analysed travel time
variation plots in heterogeneous traffic conditions, using GPS trajectory data from the buses
in Chennai, India. From their analysis, they concluded that the travel time patterns were
more related for consecutive vehicles (with a headway15 min) on the same day. Weeklyand daily patterns were not as significant as the above one. Hence, they used the travel times
of the previous two vehicles for prediction. Similarly,Kumar and Vanajaksh(2012) used
a statistical test to check whether the previous trips on the same day or previous days(s)
same-time trip or previous week(s) same-day/same-time trip is significant in predicting
the travel times of and ongoing trip. The authors concluded that the previous two weeks
same-day/same-time trips and the previous three trips on the same day were significant and
could be included as inputs in the prediction model developed using a simple exponential
smoothingtechnique.
It is clear from the above attempts that, the travel time behaviour of the vehicles moving
on fixed routes is not random. There exist significant patterns in travel times for trips made
around the same time of the day. Such patterns verify the possibility of using historical
data of a certain segment to predict for the future traffic condition on the same segment.
This forms the basis for the development of the HTKFTP framework introduced in Chapter
1. The next chapter discusses the various kinds of analyses carried out on real-world bus
trajectory data to explore possible patterns in travel times.
11
-
7/22/2019 B Tech Project Thesis
23/69
CHAPTER 3
DATA ANALYSIS
Several analyses to explore the correlations and patterns in the historical trajectory data
comprising of segment wise travel times was carried out. Our goal in data analysis is two-
fold:
To verify the suitability of using historical trajectories for prediction of future travel
times for an ongoing trajectory; and
To explore any patterns in travel time data that can be used for the prediction.
3.1 Raw Data
The raw GPS data used in this study were collected over a period of 4 months, from January2014 to April 2014, from the Metropolitan Transport Corporation (MTC) buses, running on
one of the busiest routes in Chennai namely, 19B which connects Kelambakkamin south
to Saidapetin central Chennai. Each bus is equipped with a GPS device that records the
status of the bus along with its movement and pushes the status to a central server every 10
seconds. Each data point consists of the GPS coordinates and the corresponding time-stamp
as shown in Table3.1. Each bus and each route have their own identifications. Location
details of each bus stop in a selected route is collected and stored. Each bus station has its
own name, GPS coordinates as well as the IDs of the routes it belongs to, so that all the
bus stations for each route can be found. For a specific route, a bus station has a sequence
number among all bus stations belonging to this route. In most cases, there are more than
one bus travelling on a route. Each bus travels on a fixed route several times in a day.
Taking the data of the last four months, i.e., from January to April 2014, there were
totally 28 buses on 19B route which completed 3,686 trajectories running back and forth.
The north-bound 19B route with an ID 1101 is chosen for analysis. This route has 15 stops,
with the origin at the Kelambakkam Bus Station and the last stop at Saidapet Bus Depot. It
-
7/22/2019 B Tech Project Thesis
24/69
Table 3.1: A sample of the raw data received from the GPS devices on the buses.
Timestamp Longitude Latitude
04-Apr-14 09:28:15 80.242317 13.005729
04-Apr-14 09:28:25 80.242317 13.005729
04-Apr-14 09:28:35 80.242241 13.005681
04-Apr-14 09:28:45 80.241928 13.005391
04-Apr-14 09:28:55 80.241828 13.004879
covers a distance of 29.4 kilometres (i.e. 147 segments) and the average trip duration from
the origin to the destination, is about 4000 seconds. From January to April 2014, there are
totally 2,212 north-bound trajectories in this route.
3.2 Data Transformation
From the raw data of time-stamp and latitude/longitude, other useful quantities such as
distance, cumulative distance, UNIX time, time difference and speed were calculated as
explained below. Distance (assuming straight line travel) between two consecutive GPS
locations of a bus was found out using thehaversine formulaas shown in Equation3.1.
D=R cos1 (a + b)
where
R= radius of Earth= 6371000m(mean)
a=cos
2 lat1
cos
2 lat2
b= sin
2 lat1
sin
2 lat2
cos(lon1 lon2)
(3.1)
Table3.2shows sample transformed data. From the calculated distances, the corresponding
cumulative distance travelled till each GPS point was also calculated. The timestamp data
initially in the format "dd-mm-yyyy HH:MM:SS" (a string), was converted into UNIX time
format1. This conversion speeds up several operations with the time-stamps. With the help
of these time-stamps, the time difference between each pair of consecutive GPS points was
calculated (column with headingt(s) in Table 3.2). Speed of the bus at a particular point
was calculated by dividing the corresponding value of distance by the time difference.1The UNIX time form of time-stamp is the number of seconds (an integer) passed since 00:00:00 hours,
January 01, 1970 till the timestamp under consideration.
13
-
7/22/2019 B Tech Project Thesis
25/69
Table 3.2: A sample of data records after transformation.
UnixTime(s) Lon( ) Lat( ) t(s) Dist(m) CumDist(m) Speed(m/s)
1379390422 80.127693 12.923 10 43.1092 294.1599 4.31092
1379390432 80.127899 12.922989 10 22.384146 316.544 2.238414
1379390443 80.128448 12.92292 11 60.106274 376.6503 5.464206
1379390453 80.128997 12.922869 10 59.87249 436.5228 5.987249
1379390463 80.129661 12.922829 10 72.165902 508.688 7.21659
3.2.1 Data Cleaning
There were several stray records in the raw data. These could be detected using the distance
and time difference values. Some records had distance more than 1000 metres in a 10
second interval which is impossible since the corresponding speed becomes more than 360
km/h. This may be because of errors in (or misplacement of) the longitude and latitude
values. A higher value of time difference implies the absence of several GPS logs. The
distance in these cases is also inaccurate (since we assumed straight line travel and the bus
might have undergone several changes in direction in a long time). Such data were detected
in an automated way and were not considered for analysis.
3.2.2 Extracting Trip Data
The daily data files for each device included multiple trips made by that bus in that day. The
first task was to extract the data trip wise and store in separate CSV files. Each such trip file
consisted of about 600 records with the first record corresponding to departure of the bus
from the origin bus station and the last one corresponding to the arrival at the destination
bus station. There were 3 - 4 trips made by each bus per day. Each trip file was named in
the following name format: "IMEI_date_start time_direction.csv", whereIMEIand
dateare the IMEI number corresponding to the device and the date of data. Start timeis the
timestamp of the first record of the file (i.e. departure time) and directionimplies whether
it is a north bound trip or a south bound trip.
After this extraction, the cumulative distance was updated for each trip separately. The
cumulative distance and the UNIX time were used for plotting the space-time trajectories
(cumulative distance vs. cumulative time) for further analysis as discussed in Section 3.4.
14
-
7/22/2019 B Tech Project Thesis
26/69
3.2.3 Calculation of Segment-wise Travel Times
For the calculation of travel time, the study routes were discretised into smaller segments
of length 200 meters each. These segments had fixed end points which were maintained
throughout the analysis for calculating the historical travel times. For a particular route,
these segmental travel times were stored in a grid layout where each column represented a
segment and each row representing a trip. Thus, a column consisted of the travel times on a
particular segment for all the trips over four months and a row consisted of the travel times
on all the segments of the route for a particular trip.
3.3 Correlation Between Segments
Analysis was carried out to find correlations between the segments, which check that the
choice of passed segment travel times as a similarity measure to find historical trajectories,
is suitable for prediction of future segment travel times. If a correlation in terms of travel
times exists between segments, previous segments can be taken as related to later segments
along the route. Given a current trajectory and its similar historical trajectory in terms of
passed segments, their future travel times are also similar with a high probability. We use
Pearsons correlation as the tool to measure the correlation between segments. Pearson
Product Moment Correlation (Pearsons correlation for short) is widely used to measure
the linear association between two variables. The value of the Pearsons correlation coeffi-
cient always falls between -1 and 1. Positive values mean positive correlations and negative
values mean negative correlations. The farther the value from 0, the stronger is the corre-
lation. Given two variablesXand Y, with means X
and Y
and the standard deviationsX
andy, correlationbetween them is computed as,
=
ni=1
(Xi X)(Yi Y)
(n 1)XY
(3.2)
wheren is the number of elements in X andY. The farther two segments are from each
other, the weaker will be the influence of one on the other.
Figure3.1can be used to detect the Pearsons correlation for any two segments as well
as its trend along with distance. Y-axis values are the Pearsons correlation coefficient
15
-
7/22/2019 B Tech Project Thesis
27/69
between two travel time arrays (historical travel times corresponding to two segments) and
X-axis shows the number of segments between them, which is termed as the Segment-
Distance. For example, given a Pearsons correlation between segment 20 and segment 25,
a corresponding point is drawn on the figure with the X-value being 5 (which is 25 minus
20). However, such a figure is not able to offer a clear illustration of the change of Pearsons
correlation because there are too many points for each X-axis value. To solve this problem,
we plotted Figure3.2,which represent the average value of all points for each X-value.
As shown in Figure 3.1, the Pearsons correlation exists commonly between any ar-
bitrary segments. However, the correlation does not appears to be high for most pair of
segments. Specifically, when two segments are near to each other, the Pearsons correlation
is remarkable and obviously higher than others. Therefore a segment is more related to
nearby segments than farther ones. Figure3.2 indicates an apparent decline curve from 1
along the X-axis. Based on this, it can be concluded that segments closer to the one being
analysed is the most correlated one and can be used as input for prediction.
Figure 3.1: Pearsons correlation coefficients versus the segment distance.
3.4 Travel Time Patterns
The second goal of data analysis was to explore any pattern inside the data that could be
used for the prediction. Intuitively, travel times of a segment should not only be related to
that of near segments, but also to other segment specific or traffic related parameters. For
16
-
7/22/2019 B Tech Project Thesis
28/69
Figure 3.2: Average correlation coefficient versus the segment distance.
example, in a city area, the traffic conditions are usually the worst during the peak hours in
the morning and evening. Therefore, we can associate the travel times to a temporal feature.
Similarly, the travel times on the same segment may appear differently in weekdays and
weekends. In weekdays, the travel time may be higher than that on weekends.
The present study analyses two of those patterns, which are most common, namely day-
wise pattern and time of the day pattern. During peak hours in the morning and evening,
congestions happen with a high probability. Therefore the travel time of a segment may be
high during peak hours and low in off-peak hours. To visualize the travel time variation
within a day, within-day travel times are grouped into 14 time periods 2 of 1 hour each.
Figure3.3shows the variations in travel times along a day for two typical segments namely,
Segment 28 and Segment 100. For each of these segments, travel times are assigned into
the selected 14 bins in terms of the hour in which they happened. The Y-axis represents
the travel time in seconds. For each box plot, the thick line in the middle of the box is the
median. The upper edge and lower edge of the box are the75th and25th percentiles of the
data, respectively. Some data regarded as outliers are shown as bubbles (outside the upper
and lower fences3).
It can be seen that the travel times in the morning from 8 am to 10 am and in the evening
2The usual working hours of the MTC buses.3
The upper fence (end point of dotted line) is calculated as Median + 1.5(IQR) and the lower fence asMedian - 1.5(IQR), where IQR = The inter-quartile range, i.e., the difference between the 75th and 25th
percentile values of the data.
17
-
7/22/2019 B Tech Project Thesis
29/69
(a) Variation of travel times on Segment 28 across the hours of the day
(b) Variation of travel times on Segment 100 across the hours of the day
Figure 3.3: Travel time analysis by hours of the day
from 5 pm to 7 pm are relatively higher than others. It can be expected that travel times
on a segment happening in peak hours are more similar to those in other peak hours, and
travel times in off-peak hours are also likely to be similar to each other. As a general
rule, travel times which occurred around the same time of the day are more similar to
each other. This forms the basis for the temporal neighbourhood scheme introduced in
Chapter 4. According to the scheme, historical trajectories which occurred within a fixed
temporal neighbourhood (of 30 minutes or 1 hour) of the test trajectory are more reliable
for prediction that those outside the neighbourhood.
18
-
7/22/2019 B Tech Project Thesis
30/69
Figure 3.4: Comparison between weekday peak and weekday off-peak trips.
Figure3.4 shows the space-time trajectories4 for all the 2,212 trajectories. The blue
trajectories happened in the peak hours whereas the green ones happened in the off-peak
hours during the weekdays. Clearly, the peak hour trajectories have more variance than
those in off-peak hours.
Figure3.5shows a heat-map that represents the correlation matrix which was obtained
by binning all the historical travel times on Segment 28 into 14 bins (corresponding to the 14
working hours in a day) and calculating the Pearsons correlation coefficients among them.
The diagonal squares are all white (correlation = 1) since these represent the correlation of
one bin with itself. It is clear from the heat-map that the squares closer to the diagonal are
whiter than those away from the diagonal. This means that the historical travel times which
occurred temporally closer (within a radius of 1-2 hours) to each other are more correlated
to each other. This conclusion forms the basis of the temporal neighbourhood feature for
the selection of similar historical trajectories, as discussed in Section4.7.
Travel times are not only related to the hour they happen, but also to the day on which
4A plot between the cumulative distance and the cumulative time taken to cover that distance.
19
-
7/22/2019 B Tech Project Thesis
31/69
Figure 3.5: Correlations between travel times occurring in different hours of the day
20
-
7/22/2019 B Tech Project Thesis
32/69
Figure 3.6: Comparison between weekday and weekend trips.
they happen. To verify the correlations between travel times and the day, we classified the
days into 2 classes namely, weekday and weekend. As Figure3.6indicates, travel times
in weekdays have higher variance than those in weekends. Thus, the assumption of taking
weekday/weekend as a discriminative feature for trajectory selection is also valid. A similar
analyses across different days of the week is shown in Figure3.7and it can be seen that they
are not distinctly different from each other and hence they were not separately analysed.
From the above analyses, it can be concluded that several patterns exist in the travel
times of buses moving on the same route. In the present case, the weekday/weekend pat-
tern and the intra-day hourly pattern are the most significant. Based on these patterns, two
schemes based on the temporal features of the trajectories are proposed in Chapter 4. From
the correlation analysis, it was concluded that the correlation between closely spaced seg-
ments is significant. This forms the basis for the passed segments scheme proposed in the
next chapter.
21
-
7/22/2019 B Tech Project Thesis
33/69
Figure 3.7: Comparison between the weekdays.
22
-
7/22/2019 B Tech Project Thesis
34/69
CHAPTER 4
THE FRAMEWORK AND THE CLUSTERING
ALGORITHM
Through the data analysis presented earlier in Chapter 3, we observed the correlations be-
tween the segment travel times and the various trajectory features. Cluster analysis was
adopted for the identification of the most correlated trips and is discussed in this chapter
along with the other schemes based on temporal features. Using the identified trips as input,
a novel travel time prediction framework, called Historical Trajectory and Kalman Filter
based Arrival/Travel Time Prediction (HTKFTP), based on a large collection of historical
bus trajectories is developed, the details of which are discussed in the next chapter. This
chapter focuses on the historical trajectory selection part whereas the prediction algorithm
is discussed in details in Chapter 5. The section below defines the necessary terminology
that are used in this framework.
4.1 Terms and Definitions
Since the buses are travelling on fixed routes, the geometrical routes in a two-dimensional
space can be represented in a one-dimensional space, where the position of each point on
the route is the distance from the start of the route. A route can be considered as consisting
of points on it and in a classical way, we choose a series of points to represent a route. In
our case, each point on the route is at a distance from the origin which is a multiple of
200 meters (along the route), so that the entire route is split into segments of 200 meters
length. This choice, of having smaller segments to represent the route, was made to closely
capture the pattern of segment-wise travel times along the route for a particular journey.
The various terms used in this study, are defined below.
1. A raw routeRraw is represented as a sequence of points, p0, p1,...,pn.
Each point,pi , stands for the starting point of theith segment and its value denotes the
total distance along the route from the starting point of the route to the end of(i 1)th
-
7/22/2019 B Tech Project Thesis
35/69
segment. Thus,pi < pi+1. nis the total number of points on the route including the origin
and destination.
2. A segmentSi is a part of a route between two adjacent points piandpi+1.
3. A routeR is represented by a sequence of segments,S0, S1,...,Sn1.
The value ofSidenotespi+1-pi. Our goal is to predict the arrival times at the bus stops.
Each bus stop has a latitude and longitude which lies on the route.
4. A route is also represented by a sequence of bus stops, B0, B1,...,Bm.
The value ofBi denotesS0+ ...+Sl1+ d if the bus stop Bi lies on the segmentSl.
Here,d is the distance along the route frompl (end point of segment Sl1) to the locationof bus stop Bi. The trajectory data of a bus journey consists of a series of time-stamped
locations of the bus on the route.
5. A raw trajectoryTraw is represented as a sequencep0, t0, ..., pn, tn.
piR andtidenotes the travel time fromp0to pi.
6. A trajectoryTis a sequencet0,..., tN.
ti denotes the travel time on Si andN ( =n 1) is the number of segments on the
route. During a complete trajectory, a bus generates travel times on all the segments on the
route. Therefore, given Mhistorical trajectories, there are Mtravel times for each segment.
7. For a segment Si, there is a corresponding sequence of travel times,tSi0
, ..., tSiM1.
tSij is the travel time of a bus on this segment in the(j+ 1)th trajectory andMis the
number of historical trajectories. This sequence is denoted asSTi for Si.
For a particular route, all the historical trajectory data can be stored in a table format in
which the columns represent theattributes of the trips such as start time from the origin,
date of trip and the travel times on each segment whereas the rows (or records) represent
the individual trajectories. As discussed later in this chapter, the historical travel times on
a particular segment are clustered into smaller groups so as to minimize the within-cluster
variance for each group.
8. Given a sequence of historical travel timesS Ti for a segment Si, it can be split into
a sequence of intervals (orclusters)SCi := CSi0 ,...,C
SiK1.
24
-
7/22/2019 B Tech Project Thesis
36/69
Kis the number of travel time clusters for Si. Note that,Kis a random variable which
depends on the variance of the historical travel times on the segment. The process from STi
toSCi is explained later in this chapter.
4.2 Problem Formulation
Consider a bus route R:= S0,...,SN1with Nsegments. For a bus travelling on segment
Si, its current (incomplete) trajectory Tcurr can be represented as a sequence of travel times
on the passed segments, i.e. Tcurr := tcurr0 ,..., tcurri (0 i N). Let d be the
distance of bus from pi+1 (end point ofS
i) along the route. Suppose the bus stopB
j at
which to predict the bus arrival time lies onSl (l > i) anddbe the distance ofBj frompl
(start point ofSl).
GivenMhistorical trajectories andTcurr , we aim to develop an effective framework to
predict the travel times ti, ..., tl on the segmentsSi,...,Sl. The arrival time of the
bus atBi is given by,
ABi =T+
d
Si
. ti+ ti+1+ ... + tl1+
dSl
. tl (4.1)
whereTis the current time and ABi is the arrival time atBi.
4.3 Overview of the Framework
In this section, we first provide an overview of the proposed HTKFTP system frameworkand then discuss the details of the cluster analysis (in the passed segments scheme) car-
ried out for pattern identification. Figure 4.1 shows the system design of the HTKFTP
framework. As illustrated, the proposed HTKFTP system (i.e., a location based service)
continuously collects bus trajectory data from GPS-equipped buses which report the latest
bus status including time-stamped geographical coordinates of the bus and instant speed.
The HTKFTP server is responsible for receiving and storing the trajectory data, monitoring
the incomplete trajectories of on-going buses, and making prediction of bus travel time on
the routes in response to (i) passenger enquiries and (ii) real time updates of bus arrival
25
-
7/22/2019 B Tech Project Thesis
37/69
Figure 4.1: Overall architecture of the HTKFTP framework
times at bus stops. As shown in Figure 4.1, the HTKFTP server consists of three modules:
a) Real-time Bus Status Monitoring (RBSM) module; b) Travel Time Prediction (TTP)
module; and c) Nearest Neighbour Search (NNS) module.
The RBSM module is responsible for communicating with the buses to receive bus
status information and GPS data updates of the on-going trajectories. Once an update from
a bus b reaches the server, RBSM catches the status (such as current bus coordinate and
new time stamp) ofb, extracts features associated with the developing trajectoryTb, andstores the information as part ofTbin the historical trajectory repository.
The TTP module is responsible for predicting the arrival times of buses at bus stops,
which can be reduced to a problem of predicting the travel times of buses on their remain-
ing route segments. As mentioned, the TTP module can be invoked to make predictions by
(i) a passenger enquiry; or (ii) the real-time updates of bus arrival information at stops. The
former arrives on demand and the latter happens periodically. In this paper, for simplicity,
we focus on predicting the travel time of a bus, given its current location, on remaining seg-
ments of its journey on the bus route. Moreover, instead of constantly making predictions,
26
-
7/22/2019 B Tech Project Thesis
38/69
we assume that TTP is invoked every time when RBSM receives an update that the bus has
crossed a segment (including the GPS data of bus location) and passes the required input
parameters for prediction to TTP. Our idea behind the TTP module is to use a few best
matches for the ongoing trajectory as inputs to Kalman Filter, which, efficiently predicts
the travel times on the future segments by employing a robust mechanism.
As illustrated in the figure, TTP relies on NNS module to search for similar trajectories
effectively and efficiently. As there could be different ways to identify the sample set of
similar trajectories, different notions of similarity could be explored to ensure the effective-
ness of TTP. On the other hand, with a massive amount of historical data, it is infeasible to
make exhaustive comparison between the trajectory of current bus journey against all the
historical trajectories in the database. To ensure the search efficiency, we create indices of
trajectories and related patterns in the NNS module to avoid retrieval of irrelevant trajec-
tories that are not helpful for our travel time estimation. In other words, we only fetch a
relatively small set of candidate trajectories and return them back to TTP.
The HTKFTP is introduced above as a general framework to support travel time predic-
tion. The remaining task is to devise similarity trajectory based prediction schemes which
first invoke the NNS module to retrieve a sample set of trajectories for making effective
travel time estimation in the TTP module. Based on our data analysis, we observed the
travel time correlation between two segments and the travel time patterns corresponding to
some temporal features such as hours and days. Therefore, we follow these observations
to introduce two schemes based on passed segments and temporal features. As their names
suggest, these two schemes use the passed segments (PS) and temporal features (TF) of an
on-going bus journey, respectively, to identify similar trajectories for prediction.
4.4 Trajectory Search based on Passed Segments Scheme
In PS scheme, the prediction is done by finding the historical trajectories "similar" to the
current one in terms of the travel times on the segments already crossed by the moving bus.
Thus a similarity measuring algorithm has to be taken into consideration. As mentioned
before, the conventional algorithms measuring similarity of time series such as, Lp-norm,Dynamic Time Warping(DTW) andLongest Common Subsequence(LCSS), are not appro-
27
-
7/22/2019 B Tech Project Thesis
39/69
priate in this project because those algorithms are highly sensitive to any error or outlier
in the data. As a result, a slight variation in the collected data might result in dramatical
mismatches between the current trajectory and historical trajectories. In addition, these
algorithms only evaluate the overall similarity of the whole trajectory, rather than the sim-
ilarity of trajectories with respect to each segment. To address the above problems, we
propose a new similarity measure that takes into account the similarity between two tra-
jectories on each segment. Given two trajectories,t0, ..., tn andt
0, ..., t
n, we
compare each pair of travel timesti andt
i. If the difference between each pair is less
than a threshold specific to that segment, the two trajectories are considered "similar". This
method improves the conventional distance measure algorithms in that, for two "similar"
trajectories, not only the whole ones, but also the corresponding segments should be similar.
However, this method is limited by the low efficiency that is caused by searching for
similar travel times based on each segment, especially when the number of historical tra-
jectories is large. To overcome this, we can allocate travel times into clusters and match the
current travel time to the cluster averages to find the appropriate cluster.
To better illustrate this problem, we provide the following example. For a specific route,
given a number of historical trajectories on this route, we can create a table with attributes
(columns) corresponding to the travel times of each segment of the route and a record
corresponding to each historical trajectory (Table4.1).
Table 4.1: An example of segment-wise travel times on historical trajectories.
Trajectory ID Segment0 Segment1 Segment2 Segment3
Trajectory1 20 155 63 29
Trajectory2 32 89 61 33
Trajectory3 15 262 55 37Trajectory4 93 90 73 21
Trajectory5 68 75 77 26
. . . . .
. . . . .
. . . . .
TrajectoryM t0 t1 t2 t3
By using the clustering algorithm that will be discussed next, we partition each column
into several non-overlapping ranges. Each range contains at least one value and each valueonly falls in one range. This Table4.1can be transferred into a table as shown in Table4.2,
where the number of ranges for each segment is not necessarily equal.
28
-
7/22/2019 B Tech Project Thesis
40/69
Table 4.2: An example of partitioned segment-wise travel times after application of cluster-
ing algorithm.
Segment0 Segment1 Segment2 Segment3
15;20 75;89;90 55 21
32 155;180 61;63 26;29
68 262 69;73 33;36;37
82;93 - 77 -
4.5 The Clustering Algorithm
In this section we consider an algorithm used to partition each sequence of travel time
values,S Ti, into a sequence of clustersSCi. Since,S Ti is a sequence of numerical values
and can be represented in a one-dimensional space. Splitting such a sequence is actually to
allocate a set of one-dimensional data into clusters. Leeet al.(2012) compared two robust
clustering algorithms namely,K-meansandV-clustering. As they pointed out in their work,
the K-means algorithm has two limitations. Firstly, the initial cluster centroids are chosen
randomly and different choices may cause different clustering results. Another issue is how
to determine the value of K. With no common direction on this problem, it is hard to offer
a perfect value of K. They also found V-clustering to be performing better (in the passedsegment scheme) than K-means with the help of experiments with real-world data. So, in
this study, we concentrate only on the V-clustering algorithm.
This V-clustering algorithm was introduced by Yuan et al. (2010) to allocate a sorted
list of one-dimensional data into clusters. In this algorithm, a list of values is first sorted.
Then it is split into clusters in an iterative manner. At each iteration, the list is split into two
parts and the weighted average variance (WAV) is calculated for the resulting child lists.
An optimum split is found out that minimizes the WAV of the resulting child lists. The
WAV for a split at theith element of the list is defined in Equation4.2.
wavi=
L1i
L
V ar(L1i) +L2
i
L
V ar(L2i) (4.2)
where |Li1| and |Li2| are the cardinalities of the resulting child lists for the ith split and
V ar(Li1)and V ar(Li2)are their respective variances. The list is recursively partitioned so
that the running time of the clustering algorithm for a segment with M historical travel
times becomes O(log M). Hence, the running time for the entire trajectory database is
29
-
7/22/2019 B Tech Project Thesis
41/69
O(Nlog M) (which is fast), where N is the total number of segments on the route. The
iteration is stopped when each cluster is left with a minimum number of travel times (or
minimum number of trajectories, MNT) which is a tunable parameter (i.e. its value is
decided to strike a balance between minimizing the errors in prediction and maximizing the
computational speed). Each cluster for a segment is associated with a cluster average, i.e.,
the average of all the travel times in it. The selection of the values of various parameters of
the clustering algorithm is made after the experiments with the real-world data as discussed
in Chapter 6.
4.6 Nearest Neighbour Search in Passed Segments Scheme
Given a current trajectory t0, t1, t2, with the passed segments S0, S1 andS2, let
t2 falls in a certain cluster for S2 and we take it as the match to t2. All trajectories
whose travel times onS2fall in the matching cluster are marked. Matching is usually done
by finding the cluster for the particular segment, whose cluster average is closest to t2
(which is the current trajectorys actual travel time on S2). The same operation is applied
toS1andS0 and then we can find trajectories whose travel times of the three past segmentfall in all the matched clusters. This method, known as Segment filtering, was introduced
byLeeet al.(2012). Since for each of the three passed segments, the historical trajectories
travel times are similar to the current trajectory, they can be considered as "similar" to the
current trajectory and can be used for prediction. However, the Segment filtering method
has a limitation when the number of historical trajectories is small (i.e.
-
7/22/2019 B Tech Project Thesis
42/69
searched to find the match.
4.7 Similarity based on Temporal Features
Besides passed segment (PS) scheme, we also propose a scheme that uses features inside
the historical data that are directly related to the travel time. Using similar trajectories found
from the PS scheme, we can provide satisfactory predictions. However, this method cannot
guarantee accuracy under all the circumstances. For example, when unusual events happen
on a future segment, it is hard to make a reliable prediction from historical data because the
events might never have happened in the history (limited by the amount of historical data
collected). Fortunately, resorting to features related to traffic information on the current
segment, we can make predictions of travel times by first selecting trajectories by matching
the temporal features and then using the PS scheme. For example, the time when a bus
enters a segment is important because the traffic changes along with time of the day. It is
common that during peak hours in the morning and evening, congestions happen with a
high probability. Also, intuitively, the segment-wise travel times on two trajectories close
together temporally may bear a high correlation with each other. As verified by Figure 3.6in Chapter 3, the weekday and weekend trips have different variances in their space-time
trajectories. Hence, in the final hybrid scheme, in order to make predictions for an ongoing
trip, the day on which it is occurring is first used to select weekday or weekend trajectories.
On this set of trajectories the TF (temporal neighbourhood) and PS schemes are applied in
sequence to find the final set of similar trajectories.
4.8 Summary
The final refined set of similar trajectories that results from the application of the hybrid
scheme introduced above, is used for prediction of travel times on the upcoming segments
of the ongoing trajectory. Experiments to support our claim that the hybrid scheme is
more effective in prediction than the individual ones, were carried out with real-world data
as discussed in Chapter 6 on performance evaluation. The prediction algorithm based onKalman Filter and the modifications made to the base Kalman algorithm are explained in
details in the next chapter.
31
-
7/22/2019 B Tech Project Thesis
43/69
CHAPTER 5
THE PREDICTION ALGORITHM
In Chapter 4, we explored the various schemes to search for similar historical trajectories
that are effective for travel time prediction (TTP). The problem now, is to use the travel
times of the identified similar trajectories to predict for the current trajectory. The simplest
way to predict the travel time on an upcoming segment of the current trip is to use the mean
(or median) of all the travel times from the identified similar trajectories on the correspond-ing upcoming segment. Another way is to give weights to the individual trajectories before
calculating the mean. The weight given to a similar trajectory can be the inverse square of
its Euclidean distance1 from the ongoing trajectory, as explained in Larose(2005). How-
ever, in this study, we focus on a robust, short-term prediction technique based on Kalman
Filter (KF) which can take into account the associated variability to a certain extent. Before
moving on, we review some previous work involving Kalman Filter for travel time predic-
tion including those which were attempted in heterogeneous traffic conditions prevalent in
India.
5.1 Travel Time Prediction using Kalman Filter
The first introduction of Kalman Filter dates back to 1960, whenKalman(1960) published
his famous paper describing a recursive solution to the discrete-data linear filtering problem.
The Kalman filter is a set of mathematical equations that provides an efficient computational
(recursive) means to estimate the state of a process, in a way that minimizes the mean of
the squared error. As mentioned inWelch and Bishop(2006), the filter is very powerful in
several aspects: it supports estimations of past, present, and even future states, and it can
do so even when the precise nature of the modelled system is unknown.
In the literature of travel time prediction,Chein and Kuchipudi(2002),Liuet al.(2006),
Nanthawichit et al.(2003),Chen and Chein(2001) andYang(2005) are some of the earli-
1Square root of the sum of squares of differences between the corresponding segments of the historical
and the current trajectory (till the segments crossed in the current trip)
-
7/22/2019 B Tech Project Thesis
44/69
est to introduce KF.Nanthawichitet al.(2003) andYang(2005) explored the possibility of
using GPS probe vehicle data into KF for travel time prediction. Vanajakshi et al. (2009)
is one of the earliest attempts that used KF with GPS probe vehicle data for short-term
travel time prediction under heterogeneous traffic conditions such as those prevalent in In-
dia. From their travel time variation plots (across the route), the authors concluded that
the travel time patterns along the route were more related for consecutive vehicles (with a
headway15 min) on the same day. Weekly and daily patterns were not as significant as
the above one. Hence, they used the travel times of the previous two vehicles for predicting
the travel time of the test vehicle (the ongoing trip). However, when the headways between
the consecutive vehicles are more (1 hour), the accuracy of the approach decreases (This
is a serious issue when the previous vehicles passed during an off-peak hour and the test
vehicles passes in a peak hour, or vice versa.). Since, inputs to KF in this method are fixed
most of the times, there is no means to rectify the accuracy once one of the above mentioned
issues creep in. Hence, there was a need to modify the method such that it uses dynamic in-
puts for prediction in order to address the prevalent traffic condition at the moment. This is
where the similar trajectory search as discussed in Chapter 4, can be helpful. Based on the
latest actual travel times of the test vehicle, the trajectory search algorithm finds all the his-
torical trajectories which occurred under the same traffic conditions as the current one. In
Chapter 6, we prove that the dynamic input method outperforms the static input method by
using real-world data. In the following section, we discuss the base KF algorithm as men-
tioned inVanajakshiet al.(2009). In the subsequent sections, we discuss the changes made
to both the base KF algorithm and the trajectory search algorithm to effectively integrate
them for travel time prediction.
5.2 The Base KF Algorithm
It is assumed that the evolution of travel time between the various segments is governed by,
ti+1=aiti+ wi (5.1)
whereti is the travel time taken for coveringSi (theith subsection),ai a parameter that
relates the travel time taken inSi to the travel time taken inSi+1andwi the process distur-
33
-
7/22/2019 B Tech Project Thesis
45/69
bance associated withSi. The measurement process was assumed to be governed by,
zi= ti+ vi (5.2)
wherezi is the measured time of travel inSi andvi the measurement noise. It was further
assumed thatwi andvi are zero mean white Gaussian noise signals with Qi andRi being
their corresponding variances.
The prediction algorithm requires as input, at least two trajectories in the form of
segment-wise travel times. Trajectory which is more similar to the current one is called
basetrajectory (denoted byTbase) and the other one is calledcorrectiontrajectory (denoted
byTcorr). The data obtained fromTbase was used to obtain the value ofai for each subsec-
tion. The data fromTcorr were used in the prediction algorithm to obtain the estimate of
travel time of the test (or the ongoing) trajectory (denoted by Ttest). Following are the steps
involved in the algorithm:
1. The travel time data from Tbase was used to obtain the value ofai through ai =
tTbasei+1 /tTbasei ,i = 1, ..., (N1), wheret
Tbasei is the travel time taken inTbase to
coverSi.
2. The discretisation is carried out over space rather than over time (as is done in tradi-
tional applications of the KF). Let tTtesti denote the travel time taken by inTtest to
coverSi. It is assumed thatE[tTtest1
] = t1, andE[(tTtest1 t1)
2] =P1, where
t1is the estimate of the travel time in TtestonSi.
3. Fori = 2, ..., (N1), the following steps are performed:
(a) The a priori estimate of the travel time is calculated using ti+1 = ait+
i ,
where the superscript - denotes the a priori estimate and the superscript +
denotes the a posteriori estimate.
(b) The a priori error variance (denoted by P) was calculated using Pi+1= aiP+
i ai+
Qi:
(c) The Kalman gain (denoted byK) was calculated usingKi+1= P
i+1
Pi+1
+Ri+1:
(d) The a posteriori travel time estimate and error variance were calculated using,
34
-
7/22/2019 B Tech Project Thesis
46/69
respectively, t+i+1= ti+1+ Ki+1[zi+1
ti+1]andP+
i+1= [I Ki+1]P
i+1,
where the data measured fromTcorr was used for providing the values ofzi+1
in the equation to calculate t+i+1.
Thus, the objective here is to predict the travel times ofTtest using the travel time data
obtained fromTbase andTcorr . When theTtestis inSi, its travel time for Si+1, which is de-
noted byti+1, is predicted. The KF algorithm works like a predictor-corrector algorithm.
The a posteriori estimate ofti of theTtest is used to obtain the a priori estimate ofti+1
(this being the prediction step) and then the measurement of the travel time, Tcorr inSi+1
(which is denoted by zi+1in the equations in step 4d above) is used to obtain the a posteriori
estimate ofti+1ofTtest (this being the correction step). In the following section, we dis-
cuss the modifications made to the trajectory search algorithm and the base KF algorithm
in order to integrate them and to tackle a few issues concerned with the variance of travel
times.
5.3 Integration of Trajectory Search and Prediction algo-
rithms
As we discussed in the previous section, the KF based algorithm needs only two best
matched trajectories for travel time prediction. Based on the actual travel times received
in real-time from the test vehicle, the trajectory search algorithm finds similar historical
trajectories with travel time patterns matching that of the current one. The task now, is to
rank the matched trajectories based on some metric and send the top two to the prediction
algorithm. To accomplish this, for each matched trajectory, its Euclidean distance from the
test trajectory is found out using the equation,
ED =
(tTtest1 t
Thist1 )
2 + (tTtest2 tThist2 )
2 + ... + (tTtestm tThistm )2 (5.3)
whereED is the Euclidean distance between the test trajectory and a matched historical
trajectory, tTtesti is the travel time onSi for the test vehicle, tThisti is the travel time on
Si for a matched historical trip and m the number of segments crossed by the test vehicle
when the request is made. The above Euclidean distance gives the measure of similarity
35
-
7/22/2019 B Tech Project Thesis
47/69
Figure 5.1: Variation of travel time variance across the segments of 19B route
between two trajectories with respect to their individual segment travel times. The matched
trajectories are now ranked according to the increasing values of their EDs from the test
trajectory. The top two are sent to the prediction algorithm. As the test vehicle moves from
one segment to the next one, with the newly available actual travel time of test vehicle, the
trajectory search algorithm again finds the best matches from history, ranks them and sends
the top two to the prediction algorithm, which updates the previous predictions with more
accurate ones, thus making the process dynamic in nature.
5.4 Modifications
High variances in travel times during certain periods of the day and on certain segments,
leading to higher prediction errors on selected trips or segments was the main issue faced
by the existing algorithm. As can be seen in the box plots in Figure 3.3(Chapter 3), during
the peak hours, besides the median travel time (thick line inside the box), the variance of
travel times also increases (indicated by increased height of the box). Figure5.1below,
shows that the variance of travel times is also high in certain segments on the route. Each
line in the plot is obtained by calculating the variance of travel times at each segment for
the trips occurred in a two hour band in the history.
To address the high variance (to some extent), theQi andRi values which represent
36
-
7/22/2019 B Tec