b tech project thesis

Upload: rakesh-kumar-behera

Post on 08-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/22/2019 B Tech Project Thesis

    1/69

    DATA ANALYTICSBASED

    DYNAMIC PASSENGER INFORMATION SYSTEM

    A Project Report

    submitted by

    RAKESH BEHERA

    in partial fulfilment of the requirements

    for the award of the degree of

    BACHELOR OF TECHNOLOGY

    TRANSPORTATION DIVISION

    DEPARTMENT OF CIVIL ENGINEERING

    INDIAN INSTITUTE OF TECHNOLOGY MADRAS

    MAY 2014

  • 7/22/2019 B Tech Project Thesis

    2/69

    CERTIFICATE

    This is to certify that the project report titled Data Analytics Based Dynamic Passenger

    Information System, submitted byRakesh Behera, to the Indian Institute of Technology,

    Madras, for the award of the degree ofBachelor of Technology, is a bonafide record of

    the research work done by him under my supervision. The contents of this report, in full orin parts, have not been submitted to any other Institute or University for the award of any

    degree or diploma.

    Dr. Lelitha Devi V.

    Project Guide

    Associate Professor

    Dept. of Civil Engineering

    IIT-Madras, 600 036

    Prof. Meher Prasad A.

    Head of the Department

    Professor

    Dept. of Civil Engineering

    IIT-Madras, 600 036

    Place: Chennai

    Date: 19th May 2014

    i

  • 7/22/2019 B Tech Project Thesis

    3/69

    ACKNOWLEDGEMENTS

    My earnest thanks to Dr. Lelitha Devi, for her support throughout the study. It is through her

    guidance that the project has gained structure and been accomplished in such a short span

    of time. Her foresight and expertise has helped us make the right choices in the project

    and otherwise. I am thoroughly indebted to her for the amount of time she has spent in

    reviewing my analyses and reports. I thank her for her belief in my potential in carrying out

    the tasks involved. I consider it a privilege to have worked under her guidance.

    I also owe my gratitude to Dr. Shankar Ram C. S. for his valuable inputs. His contribu-

    tion could not have been substituted by anyone else. I also thank Dr. J. Murali Krishnan for

    the constant support and encouragement that he has provided me throughout my academic

    life at IITM. I take this opportunity to thank Akhilesh, Krishna, Siddharth and Anil for the

    help offered by them in data acquisition and the development of the online version of the

    framework. I would also like to acknowledge all the other project staff and students at the

    Centre of Excellence in Urban Transportation, IIT Madras.

    Friends have been an integral part throughout the stay here at IIT Madras. Life at IITM

    cannot be complete without them. I thank all my friends and wing mates for making my

    stay here at IIT Madras, a memorable one.

    Finally, I would like to thank my parents and my younger brothers for their enduring

    support and unconditional love, without which this project would not have been possible.

    ii

  • 7/22/2019 B Tech Project Thesis

    4/69

    ABSTRACT

    KEYWORDS: Travel Time Prediction, Historical Trajectory Search, Kalman Fil-

    ter, V-clustering.

    The present study developed a reliable system for real-time bus arrival/travel time predic-

    tion under heterogeneous traffic conditions that exist in India. The study is different from

    (and more challenging than) most of the previous studies which involved homogeneous

    traffic conditions. To accomplish the above goal, a robust framework namely, Historical

    Trajectory and Kalman Filter based Travel/Arrival Time Prediction (HTKFTP) is proposed

    in this study. The proposed framework has two major components: (i) similar trajectory

    search; (ii) travel time prediction using similar trajectories. Through the data analysis

    performed, travel time correlations (between spatially close stretches of road) and other

    temporal patterns in travel times were identified, which were used for the development of

    various schemes for the selection of historical trajectories. The prediction algorithm based

    on Kalman Filter was also improved to account for the high variance in travel times on cer-

    tain locations or during certain time of the day. The proposed schemes were corroborated

    using real-world GPS trajectory data collected from the Metropolitan Transport Corpora-

    tion (MTC) buses in Chennai.

    iii

  • 7/22/2019 B Tech Project Thesis

    5/69

    TABLE OF CONTENTS

    CERTIFICATE i

    ACKNOWLEDGEMENTS ii

    ABSTRACT iii

    LIST OF TABLES vii

    LIST OF FIGURES viii

    ABBREVIATIONS ix

    NOTATION x

    1 INTRODUCTION 1

    1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.3 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 LITERATURE REVIEW 6

    2.1 A Brief History of Traffic Prediction . . . . . . . . . . . . . . . . . . . 6

    2.2 Approaches Exploiting "Similarity" . . . . . . . . . . . . . . . . . . . 9

    2.3 Trajectory Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3 DATA ANALYSIS 12

    3.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.2 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.2.1 Data Cleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.2.2 Extracting Trip Data . . . . . . . . . . . . . . . . . . . . . . . 14

    3.2.3 Calculation of Segment-wise Travel Times . . . . . . . . . . . 15

    iv

  • 7/22/2019 B Tech Project Thesis

    6/69

    3.3 Correlation Between Segments . . . . . . . . . . . . . . . . . . . . . . 15

    3.4 Travel Time Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4 THE FRAMEWORK AND THE CLUSTERING ALGORITHM 23

    4.1 Terms and Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.3 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . 25

    4.4 Trajectory Search based on Passed Segments Scheme . . . . . . . . . . 27

    4.5 The Clustering Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.6 Nearest Neighbour Search in Passed Segments Scheme . . . . . . . . . 30

    4.7 Similarity based on Temporal Features . . . . . . . . . . . . . . . . . . 31

    4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    5 THE PREDICTION ALGORITHM 32

    5.1 Travel Time Prediction using Kalman Filter . . . . . . . . . . . . . . . 32

    5.2 The Base KF Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 33

    5.3 Integration of Trajectory Search and Prediction algorithms . . . . . . . 35

    5.4 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    6 PERFORMANCE EVALUATION 38

    6.1 Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . 38

    6.2 Parameter Optimization in Passed Segment Scheme . . . . . . . . . . . 39

    6.2.1 Spatial lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    6.2.2 Minimum Number of Trajectories (MNT) in a Cluster . . . . . 40

    6.3 Evaluation of the PS scheme . . . . . . . . . . . . . . . . . . . . . . . 41

    6.4 Evaluation of the Weekday/Weekend Temporal Feature . . . . . . . . . 41

    6.5 Evaluation of the Temporal Neighbourhood Feature . . . . . . . . . . . 42

    6.6 Evaluation of the base KF Algorithm for Prediction . . . . . . . . . . . 42

    6.7 Evaluation of the Adaptive KF Algorithm . . . . . . . . . . . . . . . . 44

    6.8 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    7 SUMMARY AND CONCLUSIONS 47

    7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    v

  • 7/22/2019 B Tech Project Thesis

    7/69

    7.3 Scope for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 48

    A PYTHON CODE LISTING FOR CLUSTERING ALGORITHM 49

    A.1 Method for creating clusters from similar trips . . . . . . . . . . . . . . 49

    A.2 Auxiliary method for finding optimum splits in the clustering algorithm 51

    A.3 Method for finding nearest neighbours from clusters. . . . . . . . . . . 51

  • 7/22/2019 B Tech Project Thesis

    8/69

    LIST OF TABLES

    3.1 A sample of the raw data received from the GPS devices on the buses. . 13

    3.2 A sample of data records after transformation. . . . . . . . . . . . . . . 14

    4.1 An example of segment-wise travel times on historical trajectories. . . . 28

    4.2 An example of partitioned segment-wise travel times after application of

    clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    vii

  • 7/22/2019 B Tech Project Thesis

    9/69

    LIST OF FIGURES

    3.1 Pearsons correlation coefficients versus the segment distance. . . . . . 16

    3.2 Average correlation coefficient versus the segment distance.. . . . . . . 17

    3.3 Travel time analysis by hours of the day . . . . . . . . . . . . . . . . . 18

    3.4 Comparison between weekday peak and weekday off-peak trips. . . . . 19

    3.5 Correlations between travel times occurring in different hours of the day 20

    3.6 Comparison between weekday and weekend trips. . . . . . . . . . . . . 21

    3.7 Comparison between the weekdays. . . . . . . . . . . . . . . . . . . . 22

    4.1 Overall architecture of the HTKFTP framework . . . . . . . . . . . . . 26

    5.1 Variation of travel time variance across the segments of 19B route . . . 36

    6.1 Optimum values of parameters involved in the clustering algorithm. . . 40

    6.2 Comparison of MAE for individual test trips before and after adding the PS

    scheme to the naive method. . . . . . . . . . . . . . . . . . . . . . . . 42

    6.3 Comparison of MAE for individual test trips before and after adding the

    weekday/weekend feature to the PS scheme. . . . . . . . . . . . . . . . 43

    6.4 Comparison of MAE for individual test trips before and after adding the

    temporal neighbourhood feature. . . . . . . . . . . . . . . . . . . . . . 43

    6.5 Comparison of MAE for individual test trips before and after using the base

    KF algorithm for prediction. . . . . . . . . . . . . . . . . . . . . . . . 44

    6.6 Comparison of MAE for individual test trips before and after using the

    Adaptive KF algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    6.7 Improvement of the mean MAE (over all the test trips) throughout the evo-

    lution of the method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    6.8 Comparison between HTKFTP and the prediction method using static in-

    puts in KF.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    viii

  • 7/22/2019 B Tech Project Thesis

    10/69

    ABBREVIATIONS

    AI Artificial Intelligence

    ANN Artificial neural networks

    DTW Dynamic Time Warping

    ED Euclidean Distance

    GPS Global Positioning System

    HTD Historical Trajectory Database

    HTKFTP Historical Trajectory and Kalman Filter based Travel/Arrival Time Prediction

    HTTP Historical Trajectory based Travel time Prediction

    KF Kalman Filtering

    k-NN k-Nearest Neighbors

    LCSS Longest Common Subsequence

    MAE Mean Absolute Error

    MAPE Mean Absolute Percentage Error

    MLR Multivariate Linear Regression

    MTC Metropolitan Transport Corporation (Chennai)

    NNS Nearest Neighbour Search

    RBMS Real-time Bus Status Monitoring

    SARIMA Seasonal Autoregressive Integrated Moving Average

    SVR Support Vector Regression

    TTP Travel Time Prediction

    ix

  • 7/22/2019 B Tech Project Thesis

    11/69

    NOTATION

    Correlation coefficient between two variables

    Rraw A raw route represented as a sequence of points, pis

    Si A segment of road between two points,pi andpi+1

    R A raw route represented as a sequence of segments,Sis

    Bi Theith bus stop on a route

    ti Time taken to reach a pointpi on a route, starting fromp0

    Traw A raw trajectory represented as a sequence pairs of the form (pi, ti)

    ti Actual time taken to cover a segmentSi

    ti Predicted travel time onSi

    ABi Actual arrival time of the bus at the bus stopBi

    ABij Thejth predicted arrival time at the bus stopBi

    T A trajectory represented as a sequence oftis

    STi List of historical travel times on segmentSi

    SCi List of clusters or intervals forSi

    CSii Theith cluster forSi

    Tcurr The current (or incomplete or test) trajectory. Also denoted asTtest

    wavi The weighted average variance for a split at theith element of a list.

    ai Travel time evolution factor fromSito Si+1

    wi Process disturbance in travel time evolution atSi

    zi Measured travel time onSi

    vi Measurement noise associated withSi

    Qi Variance of the historicalwis forSi

    Ri Variance of the historicalvis forSi

    x

  • 7/22/2019 B Tech Project Thesis

    12/69

    CHAPTER 1

    INTRODUCTION

    1.1 Motivation

    With the ever-increasing number of vehicles on roads in urban areas, traffic congestion has

    become one of the most serious problems facing the society, especially the commuters.

    In India, the problem is more prominent in the metropolitan cities such as Mumbai, New

    Delhi, Chennai, etc. One of the reasons people are shifting to private transportation is the

    unreliability of the public transportation systems (Bende,2012). Holeywell(2013) points

    out that, travellers care most about getting picked up from their stop in 10 minutes or less

    to be able to make their scheduled connections. It also points out that, the travellers are not

    so interested in whether their rides are crowded or whether they can find a seat.

    In todays busy society, information regarding arrival time or travel time of transportfrom a place to another is becoming more and more valuable. With a schedule of predicted

    arrival times at each bus stop available via VMS or as mobile or web application, people

    can make timely plans for their upcoming activities and business which will reduce their

    anxiety caused by uncertain delays. Thus, there is a necessity for a system that can inform

    the travellers about the latest travel times of the concerned buses before they make their

    transit plans. This may also attract more passengers to use public transport, which in turn

    can lead to lesser traffic congestion.

    1.2 Background

    Accurate estimation of travel times of public transportation has been a challenging research

    problem that remains open for the past thirty years in the transportation research commu-

    nity (Abkowitz,1981;Polus, 1978). A simple prediction approach is to adopt the averagetravel time derived from historical data. However, making constant estimation of the travel

    time for a path, apparently does not capture the dynamic traffic conditions very well. Thus,

  • 7/22/2019 B Tech Project Thesis

    13/69

    advanced techniques for travel time estimation were proposed in the early literature (Ghosh

    and Knapp, 1978;Oda,1990;Nihan and Holmesland,1980). Even though the specific ap-

    proaches adopted in these studies are different, they share a common idea, i.e., discover

    certain regular patterns from the historical data collected over time. Some proposed to fit

    historical data to statistical models such as Gaussian models, Bayesian network and Markov

    Chains in order to facilitate statistical analysis (Polus,1978; Sumiet al.,1990). Techniques

    based on regression models learn from historical data. They involve building of regres-

    sion functions for estimating travel time in terms of various external factors (Polus,1979;

    Ghosh and Knapp, 1978). A prediction is made by using known values of those factors un-

    der current situation as input. Techniques based on time series models focus on discovering

    internal relationship among historical time-series data in order to identify similar patterns

    to make prediction under the current situation (Oda, 1990; Nihan and Holmesland, 1980).

    However, the performances of the above approaches are highly constrained by the quali-

    ty/quantity as well as the types of data available. For example, conventional collection of

    traffic data is typically conducted by surveys or using expensive sensors deployed along the

    roads at specific locations to record arrival times, traffic flow volumes, and other statistics

    of vehicles.

    In the recent years, due to the advent of positioning and wireless communication tech-

    nologies, wireless devices equipped with Global Positioning System (GPS) have been widely

    deployed on various private and public vehicles, generating massive amount of vehicle tra-

    jectory data which can be used for fleet management and other transportation applications.

    Time-tagged location data, usually represented in the form of trajectories, bring a great po-

    tential for real-time prediction of the vehicle travel times. Among the public transportation

    systems, the travel times of buses, which drive along with other vehicles on roads, are more

    difficult to predict than trains and subways, which ride on exclusive paths. First, the travel

    condition of a bus may easily get affected by various internal and external factors, including

    accidents, weather, road construction, government policies and even temperature. Second,

    for vehicles in metropolitan areas (such as Chennai), errors often exist in positional-data

    acquisition due to the interference by urban canopies and other sources of errors. Thus, in

    this paper, we propose a hybrid prediction framework to estimate the travel time of buses

    by exploiting selected historical trajectory data and an efficient state estimation techniquecapable of making precise estimations by exploiting a series of travel time measurements.

    2

  • 7/22/2019 B Tech Project Thesis

    14/69

    1.3 Research Overview

    Recently, research works on discovering traffic patterns from historical data collected from

    vehicles have received significant attention (Chenet al.,2011;Li and Rose, 2011;Tiesyte.and Jensen,2009). These works show that traffic patterns exist in road segments and thus

    could be used to predict the future traffic condition on the same segment and on a few up-

    coming segments. This finding provides the basis for using similar trajectories to predict

    the travel time of an ongoing bus journey. In this study, a new bus travel time prediction

    framework, calledHistorical Trajectory and Kalman Filter based Travel/Arrival Time Pre-

    diction (HTKFTP)for real-time prediction of travel time at upcoming segments (and thus

    the arrival time at bus stops) of an ongoing bus journey is carried out. The basic idea behind

    HTKFTP is to use a collection of historical trajectories similar to the current bus journey

    to predict the travel times in future segments of the bus journey. Specifically, the HTKFTP

    framework (i) identifies a setof similar trajectories as the basis for travel time estimation

    instead of relying on only one historical trajectory best matching the on-going bus journey;

    (ii) explores differentfeatures(e.g., travel times of passed segments as well as time/day of

    the bus trajectories) to identify the sample set of similar trajectories; (iii) uses the similar

    trajectories as inputs to the Kalman Filter based prediction method.

    Several issues were faced in the design of the HTKFTP framework. For example, many

    features are associated with the trajectories. Some of these features are categorical while

    the others are numerical. Discriminative features and properly defined similarity functions

    for those features needed to be used in order to identify a sample set of similar trajectories

    effective for travel time prediction. To determine a set of similar trajectories based on travel

    time on passed segments, the V-clustering algorithm, that partitions the whole spectrum

    of travel times on a segment into a number of intervals (or clusters) was considered. To

    determine a set of similar trajectories based on hours/days, exploratory data analysis in-

    volving space-time trajectory plots of the historical trips was carried out. Accordingly, the

    HTKFTP framework is able to retrieve the sample set of similar trajectories efficiently and

    in turn use that sample set to estimate the travel times. To corroborate the proposed ideas

    and evaluate the prediction schemes proposed, an empirical experimentation using real bus

    trajectory data collected in Chennai, India, was conducted. This research work has made anumber of significant contributions as summarized below.

    3

  • 7/22/2019 B Tech Project Thesis

    15/69

    A new framework, namely, HTKFTP, for predicting the travel times over future seg-

    ments of an ongoing bus journey based on historical trajectory data. The framework

    consists of two major components: (i) similar trajectory retrieval; and (ii) travel time

    estimation.

    A detailed data analysis to investigate the correlation between bus travel times in

    route segments and a number of trajectory features, e.g., passed segment travel time,

    hours, days, etc. Based on our analysis, we select a number of trajectory features to

    identify similar trajectories.

    A clustering algorithm for passed segment travel times and space-time trajectory

    analysis in order to group similar trajectories together. These similar trajectory clus-

    ters allow us to efficiently and effectively retrieve a sample set of trajectories similar

    to the ongoing bus trajectory.

    An efficient state estimation technique based on Kalman Filter, capable of making

    precise estimations by exploiting a series of travel time measurements in an inherent

    feedback mechanism. The base estimation scheme was modified to take into account,

    the large variance in the data observed at selected locations/times.

    Through a comprehensive experimental study, using a real data set collected from buses

    in Chennai, India, the proposed ideas were validated. The framework was evaluated in

    terms of prediction accuracy. The experimental results show that the prediction scheme

    proposed, significantly outperforms the baseline and state-of-the-art schemes.

    1.4 Chapter Outline

    The remainder of this report is organized as follows:

    Chapter 2 reviews some literature in the concerned area.

    Chapter 3, analyses the collected historical trajectory data.

    An overview of the HTKFTP framework and the similar trajectory selection algo-

    rithm, detailing its design, is discussed in Chapter 4.

    4

  • 7/22/2019 B Tech Project Thesis

    16/69

    The prediction scheme and the real-time prediction system design are detailed in

    Chapter 5.

    Chapter 6, reports a comprehensive experimental study using the collected real data

    set of bus trajectories, queried in real-time.

    Finally, we conclude this work in Chapter 7, with a summary of the work, followed

    by conclusions and scope for future work.

    5

  • 7/22/2019 B Tech Project Thesis

    17/69

    CHAPTER 2

    LITERATURE REVIEW

    This chapter reviews the past research that has fuelled our motivation in prediction of move-

    ment of vehicles. We begin by giving a brief history of traffic prediction, and review the

    major research that has focused specifically on similarity-based prediction of arrival/travel

    times and trajectory patterns.

    2.1 A Brief History of Traffic Prediction

    Research in transportation dates back to the 30s of the last century. With few vehicles on

    roads and under-developed technologies, it was then almost impossible to collect significant

    data about traffic conditions. Thus, studies during that time were mainly about identifying

    certain rules that could be used to guide traffic management and the construction of trans-portation infrastructure. For example, the relations between traffic volumes and the weather

    were reported byJohnson(1930). It justified the improvement of road surfaces during bad

    weather. As another typical example, the authors ofVey and Pope(1935) verified a definite

    relationship between highway lighting and highway accidents. In general, where adequate

    lighting is provided, there is a substantial reduction in night accidents.

    With the development of technologies and increasing number of vehicles on roads, more

    data about traffic conditions could be collected, subsequently causing the emergence of

    research on traffic prediction in the 50s. However, during this period, traffic data adopted

    in most cases were vehicle volumes (or flow) because they were easily collected by hand.

    For example, in Glanville(1955), Lighthill and Whitham(1955) and Buckley(1968), to

    obtain the vehicle volume on a road, observers were placed at certain locations to record

    the number of vehicles passed by. Such an approach was inefficient and made it difficult

    to collect a large amount of data. Therefore, the arrival/travel time prediction did not arise

    until 1970s (Wong and Sussman, 1973;Sussman et al.,1974), when traffic sensors were

    widely adopted enabling researchers to have sufficient data for analysis.

  • 7/22/2019 B Tech Project Thesis

    18/69

    Estimation of arrival/travel times, especially for buses, started to attract increasing at-

    tention since the 80s (Abkowitz,1981; Polus, 1978; Sumi et al., 1990). Along with the

    development of the society, congestions started happening increasingly in cities which cre-

    ated a need to improve the quality of public transportation service. As the most important

    aspect of public transportation service, arrival/travel time prediction became the most criti-

    cal topic in traffic prediction area. At early stage of the research on this topic, constrained

    by technologies, researchers had to work on data collected from traffic sensors and surveys,

    off-line. Since the development of GPS devices and wireless technologies, it is possible

    to collect large volume of traffic-related data in real-time. Therefore, real-time arrival/-

    travel time prediction has become a hot topic since these technologies are widely applied

    in public transportation system. Over the decades, researchers applied different models

    and methods on real-time arrival/travel time prediction. InZhuet al. (2011), the authors

    developed mathematical models taking into account the travel times on links, dwell times

    at stops, and delays at intersections. The algorithm proposed in Lin and Zeng(2001) is to

    provide real-time bus arrival information based on the bus location data, the schedule infor-

    mation, the difference between scheduled and actual arrival times, and the waiting time at

    time-check stops. Predicting methods based on historical data are also developed inTiesyte

    and Jensen(2008).

    With the development ofArtificial Intelligence, researchers have widely adopted Artifi-

    cial Intelligence methods in real-time arrival/travel time prediction. As a result of this, travel

    time prediction approaches in the modern literature can be broadly classified as model based

    anddata driven. Model based approaches predict travel times using traffic flow models and

    the underlying physical phenomena. For example,Krishnan and Polak (2008) explored

    recurring themes in traffic conditions and used k-Nearest Neighbors (k-NN) for indirectly

    predicting short term travel times using 15-minute aggregate flow data. Esawey and Sayed

    (2011) used a VISSIM1 micro-simulation model of down-town Vancouver to predict travel

    times using traffic volume and travel time data of nearby segments. Kalman Filtering(KF)

    is one of the most widely adopted methods in travel time prediction in the recent literature.

    Vanajakshi et al. (2009) used a KF based method for predicting segment-wise travel times

    (using travel times of previous two vehicles) in heterogeneous traffic conditions prevalent

    in Indian cities such as Chennai. KF takes into account the stochastic properties of the1VisSim is a visual block diagram language for simulation of dynamical systems and model based design

    of embedded systems.

    7

  • 7/22/2019 B Tech Project Thesis

    19/69

    process disturbance and the measurement noise. It works well for short-term prediction.

    Other notable works in KF include Xu et al. (2008), Shalaby et al. (2004) andZhu and

    Wang(2000). Westgate et al. (2013) used a Bayesian model for travel time estimation of

    ambulances using GPS data.

    Data driven approaches predict travel time with the use of statistical relationships, which

    are derived from historical data (travel times, speeds, volumes, etc.). The most commonly

    reported data driven approaches in the literature include machine learning techniques,time

    series analysis and historical averaging approaches. In machine learning techniques, the

    prediction model learns some properties from several instances of historical data. For ex-

    ample, Patnaiket al. (2004) used a machine learning technique called multivariate linear

    regression for bus arrival time estimation using automatic passenger counter (APC) data.

    Artificial neural networks(ANNs) is another most widely used method. Liuet al. (2009)

    used neural networks to indirectly predict travel times using traffic volume and flow data.

    ANNs has a huge advantage that it can process complex non-linear relationships. However,

    it is limited by the extremely long training time. Other notable works using ANNs include

    van Lint(2006), Zouet al. (2008) andBatool and Khan(2005). Besides, other machine

    learning methods are also popular in recent years. Real-time prediction using Support Vec-

    tor Regression(SVR) andSupport Vector Machine(SVM) has become a hot topic recently.

    For example,Wuet al. (2004) used SVR for travel time prediction using highway traffic

    data.Vanajakshi and Rilett(2007),Vanajakshi and Rilett(2004) andYuet al.(2006) are the

    other instances of the use of SVR. Similar to ANNs, SVR is too expensive in training for

    real-time updates. In a time series analysis approach, temporal patterns are identified in the

    historical data and future values are predicted with the assumption that these patterns hold

    in the near future. For example,Guin(2006) used a time series analysis approach called

    seasonal autoregressive integrated moving average(SARIMA) to predict travel times using

    historical travel time data.

    Though the model based approaches provide valuable insights into the mechanisms of

    traffic flow and queue dynamics, their inherent limitations hinder their application in real-

    time systems. The major disadvantages include high computational complexity, intensive

    model/parameter calibration, requirement for predicting traffic demand/capacity and the

    degree of expertise required for design and maintenance. On the other hand, data driven

    approaches can be deployed quicker and cheaper compared to model-based approaches.

    8

  • 7/22/2019 B Tech Project Thesis

    20/69

    They can provide scope for prediction when there is a large diversity (or variance) in the

    historical data. In such cases, predicting using physical models which are narrow in scope

    can be expensive. In this study, a data driven approach is chosen for travel time prediction

    exploiting similar historical trajectories, as explained in the following sections.

    2.2 Approaches Exploiting "Similarity"

    Since the bus trips repeat in the same route, more or less around the same time, on dif-

    ferent days, the similarity-based approach is the straightforward approach to predict future

    travel times. Great amount of work has been done on identifying similar trajectories or

    similar time series, in both one-dimension and multi-dimensions. Yi and Faloutsos(2000)

    proposed Lp-norm to compute the Manhattan Distance or Euclidean Distance as a mea-

    sure of similarity. Lp-norm is widely applied in various applications but is only available

    for time series with same length. Therefore, other similarity measures are developed and

    adopted. Berndt and Clifford(1994) introducedDynamic Time Warping(DTW) which was

    adopted later inAssentet al.(2009) andVlachoset al.(2006). The concept ofedit distance

    was introduced inLevenstein(1966) and the most widely used distance based on edit dis-tance is Longest Common Subsequence (LCSS) distance. Vlachos et al. (2002),Fashandi

    and Moghaddam(2005) andHermeset al.(2009) applied LCSS as the distance measure to

    fetch similar trajectories or time series. However, these algorithms tend to emphasize on the

    overall similarity of the whole trajectory, without considering the similarity of trajectories

    in individual or subsets of segments. Additionally, while LCSS and DTW are applicable to

    trajectory data, they are highly sensitive to noises and errors. In this project, similarity mea-

    sure of trajectories based on similarity of corresponding individual segments is proposed.

    Recently, prediction methods based on historical trajectory data have also been developed

    inJensen and Tie(2008),Tiesyte. and Jensen(2009) andTiesyte and Jensen(2008). The

    authors show that the similarity between historical trajectories and current position data

    of a bus can be exploited to predict bus arrival time at bus stations. This shares the same

    intuition with the development of the Historical Trajectory based Travel time Prediction

    (HTTP) framework inLee et al.(2012). The present project can be said to be built on top

    of HTTP with the inclusion of additional features for the selection of similar trajectories

    and the use of Kalman Filter for prediction. InJensen and Tie(2008), the authors devel-

    9

  • 7/22/2019 B Tech Project Thesis

    21/69

    oped a system called TransDB, that searches the historical trajectory database for the most

    similar trajectory based on the passed segments of the current bus trajectory in order to

    make a good prediction. The basic idea is that, based on the proposed trajectory similarity

    function, thenearest neighbourhood trajectory(NNT) and the trajectory of current bus ride

    are anticipated to exhibit similar travelling behaviour (in terms of travel time). Based on

    this assumption, the NNT serves as a good basis for predicting the future travel time of

    current bus ride without explicitly taking into account various external and internal factors.

    However, Lee et al. (2012) argue that the historical trajectory that is most similar to the

    passed segments of the current bus trajectory alone may not provide the best prediction of

    the on-going bus ride. Thus, they collect a set of similar trajectories and adopt a statisti-

    cal approach to make predictions. Additionally, they exploit different features associated

    with trajectories and develop different similarity functions to find similar trajectories that

    make significantly more accurate travel time predictions. Our approach varies from HTTP

    in a way that it is not a statistical approach. From the data analysis studies as explained in

    Chapter 4, it was found that, though the statistical approach provides satisfying predictions

    for a few upcoming segments, the predictions get worse if they were made for a time far

    into the future (future segments far from the current one). It was observed that the historical

    trajectories within a temporal neighbourhood of 30 minutes around the ongoing trajectory,

    are more significant in improving the prediction accuracy for farther future segments. Addi-

    tionally, with the error feedback mechanism inherent in the KF based prediction algorithm,

    the accuracy tends to improve from one future segment to the next one during prediction.

    2.3 Trajectory Patterns

    Patterns of historical trajectories are described in two classes: trend and periodicity. The

    trend represents a general systematic linear or non-linear component that changes over

    time and does not repeat or at least does not repeat within the time range captured by data.

    The periodicity represents the component that repeats itself in certain intervals of time. In

    Wuet al. (2003), the authors display the daily periodicity from historical data of travel

    times around the same location. Zhu et al. (2009) also conducted an analysis to verify

    the existence of periodicity of speeds over time on a route segment. Chen et al. (2011)

    andLi and Rose (2011) verify the pattern by measuring the correlation between the traffic

    10

  • 7/22/2019 B Tech Project Thesis

    22/69

    on a specific route of different time periods. Vanajakshiet al.(2009) analysed travel time

    variation plots in heterogeneous traffic conditions, using GPS trajectory data from the buses

    in Chennai, India. From their analysis, they concluded that the travel time patterns were

    more related for consecutive vehicles (with a headway15 min) on the same day. Weeklyand daily patterns were not as significant as the above one. Hence, they used the travel times

    of the previous two vehicles for prediction. Similarly,Kumar and Vanajaksh(2012) used

    a statistical test to check whether the previous trips on the same day or previous days(s)

    same-time trip or previous week(s) same-day/same-time trip is significant in predicting

    the travel times of and ongoing trip. The authors concluded that the previous two weeks

    same-day/same-time trips and the previous three trips on the same day were significant and

    could be included as inputs in the prediction model developed using a simple exponential

    smoothingtechnique.

    It is clear from the above attempts that, the travel time behaviour of the vehicles moving

    on fixed routes is not random. There exist significant patterns in travel times for trips made

    around the same time of the day. Such patterns verify the possibility of using historical

    data of a certain segment to predict for the future traffic condition on the same segment.

    This forms the basis for the development of the HTKFTP framework introduced in Chapter

    1. The next chapter discusses the various kinds of analyses carried out on real-world bus

    trajectory data to explore possible patterns in travel times.

    11

  • 7/22/2019 B Tech Project Thesis

    23/69

    CHAPTER 3

    DATA ANALYSIS

    Several analyses to explore the correlations and patterns in the historical trajectory data

    comprising of segment wise travel times was carried out. Our goal in data analysis is two-

    fold:

    To verify the suitability of using historical trajectories for prediction of future travel

    times for an ongoing trajectory; and

    To explore any patterns in travel time data that can be used for the prediction.

    3.1 Raw Data

    The raw GPS data used in this study were collected over a period of 4 months, from January2014 to April 2014, from the Metropolitan Transport Corporation (MTC) buses, running on

    one of the busiest routes in Chennai namely, 19B which connects Kelambakkamin south

    to Saidapetin central Chennai. Each bus is equipped with a GPS device that records the

    status of the bus along with its movement and pushes the status to a central server every 10

    seconds. Each data point consists of the GPS coordinates and the corresponding time-stamp

    as shown in Table3.1. Each bus and each route have their own identifications. Location

    details of each bus stop in a selected route is collected and stored. Each bus station has its

    own name, GPS coordinates as well as the IDs of the routes it belongs to, so that all the

    bus stations for each route can be found. For a specific route, a bus station has a sequence

    number among all bus stations belonging to this route. In most cases, there are more than

    one bus travelling on a route. Each bus travels on a fixed route several times in a day.

    Taking the data of the last four months, i.e., from January to April 2014, there were

    totally 28 buses on 19B route which completed 3,686 trajectories running back and forth.

    The north-bound 19B route with an ID 1101 is chosen for analysis. This route has 15 stops,

    with the origin at the Kelambakkam Bus Station and the last stop at Saidapet Bus Depot. It

  • 7/22/2019 B Tech Project Thesis

    24/69

    Table 3.1: A sample of the raw data received from the GPS devices on the buses.

    Timestamp Longitude Latitude

    04-Apr-14 09:28:15 80.242317 13.005729

    04-Apr-14 09:28:25 80.242317 13.005729

    04-Apr-14 09:28:35 80.242241 13.005681

    04-Apr-14 09:28:45 80.241928 13.005391

    04-Apr-14 09:28:55 80.241828 13.004879

    covers a distance of 29.4 kilometres (i.e. 147 segments) and the average trip duration from

    the origin to the destination, is about 4000 seconds. From January to April 2014, there are

    totally 2,212 north-bound trajectories in this route.

    3.2 Data Transformation

    From the raw data of time-stamp and latitude/longitude, other useful quantities such as

    distance, cumulative distance, UNIX time, time difference and speed were calculated as

    explained below. Distance (assuming straight line travel) between two consecutive GPS

    locations of a bus was found out using thehaversine formulaas shown in Equation3.1.

    D=R cos1 (a + b)

    where

    R= radius of Earth= 6371000m(mean)

    a=cos

    2 lat1

    cos

    2 lat2

    b= sin

    2 lat1

    sin

    2 lat2

    cos(lon1 lon2)

    (3.1)

    Table3.2shows sample transformed data. From the calculated distances, the corresponding

    cumulative distance travelled till each GPS point was also calculated. The timestamp data

    initially in the format "dd-mm-yyyy HH:MM:SS" (a string), was converted into UNIX time

    format1. This conversion speeds up several operations with the time-stamps. With the help

    of these time-stamps, the time difference between each pair of consecutive GPS points was

    calculated (column with headingt(s) in Table 3.2). Speed of the bus at a particular point

    was calculated by dividing the corresponding value of distance by the time difference.1The UNIX time form of time-stamp is the number of seconds (an integer) passed since 00:00:00 hours,

    January 01, 1970 till the timestamp under consideration.

    13

  • 7/22/2019 B Tech Project Thesis

    25/69

    Table 3.2: A sample of data records after transformation.

    UnixTime(s) Lon( ) Lat( ) t(s) Dist(m) CumDist(m) Speed(m/s)

    1379390422 80.127693 12.923 10 43.1092 294.1599 4.31092

    1379390432 80.127899 12.922989 10 22.384146 316.544 2.238414

    1379390443 80.128448 12.92292 11 60.106274 376.6503 5.464206

    1379390453 80.128997 12.922869 10 59.87249 436.5228 5.987249

    1379390463 80.129661 12.922829 10 72.165902 508.688 7.21659

    3.2.1 Data Cleaning

    There were several stray records in the raw data. These could be detected using the distance

    and time difference values. Some records had distance more than 1000 metres in a 10

    second interval which is impossible since the corresponding speed becomes more than 360

    km/h. This may be because of errors in (or misplacement of) the longitude and latitude

    values. A higher value of time difference implies the absence of several GPS logs. The

    distance in these cases is also inaccurate (since we assumed straight line travel and the bus

    might have undergone several changes in direction in a long time). Such data were detected

    in an automated way and were not considered for analysis.

    3.2.2 Extracting Trip Data

    The daily data files for each device included multiple trips made by that bus in that day. The

    first task was to extract the data trip wise and store in separate CSV files. Each such trip file

    consisted of about 600 records with the first record corresponding to departure of the bus

    from the origin bus station and the last one corresponding to the arrival at the destination

    bus station. There were 3 - 4 trips made by each bus per day. Each trip file was named in

    the following name format: "IMEI_date_start time_direction.csv", whereIMEIand

    dateare the IMEI number corresponding to the device and the date of data. Start timeis the

    timestamp of the first record of the file (i.e. departure time) and directionimplies whether

    it is a north bound trip or a south bound trip.

    After this extraction, the cumulative distance was updated for each trip separately. The

    cumulative distance and the UNIX time were used for plotting the space-time trajectories

    (cumulative distance vs. cumulative time) for further analysis as discussed in Section 3.4.

    14

  • 7/22/2019 B Tech Project Thesis

    26/69

    3.2.3 Calculation of Segment-wise Travel Times

    For the calculation of travel time, the study routes were discretised into smaller segments

    of length 200 meters each. These segments had fixed end points which were maintained

    throughout the analysis for calculating the historical travel times. For a particular route,

    these segmental travel times were stored in a grid layout where each column represented a

    segment and each row representing a trip. Thus, a column consisted of the travel times on a

    particular segment for all the trips over four months and a row consisted of the travel times

    on all the segments of the route for a particular trip.

    3.3 Correlation Between Segments

    Analysis was carried out to find correlations between the segments, which check that the

    choice of passed segment travel times as a similarity measure to find historical trajectories,

    is suitable for prediction of future segment travel times. If a correlation in terms of travel

    times exists between segments, previous segments can be taken as related to later segments

    along the route. Given a current trajectory and its similar historical trajectory in terms of

    passed segments, their future travel times are also similar with a high probability. We use

    Pearsons correlation as the tool to measure the correlation between segments. Pearson

    Product Moment Correlation (Pearsons correlation for short) is widely used to measure

    the linear association between two variables. The value of the Pearsons correlation coeffi-

    cient always falls between -1 and 1. Positive values mean positive correlations and negative

    values mean negative correlations. The farther the value from 0, the stronger is the corre-

    lation. Given two variablesXand Y, with means X

    and Y

    and the standard deviationsX

    andy, correlationbetween them is computed as,

    =

    ni=1

    (Xi X)(Yi Y)

    (n 1)XY

    (3.2)

    wheren is the number of elements in X andY. The farther two segments are from each

    other, the weaker will be the influence of one on the other.

    Figure3.1can be used to detect the Pearsons correlation for any two segments as well

    as its trend along with distance. Y-axis values are the Pearsons correlation coefficient

    15

  • 7/22/2019 B Tech Project Thesis

    27/69

    between two travel time arrays (historical travel times corresponding to two segments) and

    X-axis shows the number of segments between them, which is termed as the Segment-

    Distance. For example, given a Pearsons correlation between segment 20 and segment 25,

    a corresponding point is drawn on the figure with the X-value being 5 (which is 25 minus

    20). However, such a figure is not able to offer a clear illustration of the change of Pearsons

    correlation because there are too many points for each X-axis value. To solve this problem,

    we plotted Figure3.2,which represent the average value of all points for each X-value.

    As shown in Figure 3.1, the Pearsons correlation exists commonly between any ar-

    bitrary segments. However, the correlation does not appears to be high for most pair of

    segments. Specifically, when two segments are near to each other, the Pearsons correlation

    is remarkable and obviously higher than others. Therefore a segment is more related to

    nearby segments than farther ones. Figure3.2 indicates an apparent decline curve from 1

    along the X-axis. Based on this, it can be concluded that segments closer to the one being

    analysed is the most correlated one and can be used as input for prediction.

    Figure 3.1: Pearsons correlation coefficients versus the segment distance.

    3.4 Travel Time Patterns

    The second goal of data analysis was to explore any pattern inside the data that could be

    used for the prediction. Intuitively, travel times of a segment should not only be related to

    that of near segments, but also to other segment specific or traffic related parameters. For

    16

  • 7/22/2019 B Tech Project Thesis

    28/69

    Figure 3.2: Average correlation coefficient versus the segment distance.

    example, in a city area, the traffic conditions are usually the worst during the peak hours in

    the morning and evening. Therefore, we can associate the travel times to a temporal feature.

    Similarly, the travel times on the same segment may appear differently in weekdays and

    weekends. In weekdays, the travel time may be higher than that on weekends.

    The present study analyses two of those patterns, which are most common, namely day-

    wise pattern and time of the day pattern. During peak hours in the morning and evening,

    congestions happen with a high probability. Therefore the travel time of a segment may be

    high during peak hours and low in off-peak hours. To visualize the travel time variation

    within a day, within-day travel times are grouped into 14 time periods 2 of 1 hour each.

    Figure3.3shows the variations in travel times along a day for two typical segments namely,

    Segment 28 and Segment 100. For each of these segments, travel times are assigned into

    the selected 14 bins in terms of the hour in which they happened. The Y-axis represents

    the travel time in seconds. For each box plot, the thick line in the middle of the box is the

    median. The upper edge and lower edge of the box are the75th and25th percentiles of the

    data, respectively. Some data regarded as outliers are shown as bubbles (outside the upper

    and lower fences3).

    It can be seen that the travel times in the morning from 8 am to 10 am and in the evening

    2The usual working hours of the MTC buses.3

    The upper fence (end point of dotted line) is calculated as Median + 1.5(IQR) and the lower fence asMedian - 1.5(IQR), where IQR = The inter-quartile range, i.e., the difference between the 75th and 25th

    percentile values of the data.

    17

  • 7/22/2019 B Tech Project Thesis

    29/69

    (a) Variation of travel times on Segment 28 across the hours of the day

    (b) Variation of travel times on Segment 100 across the hours of the day

    Figure 3.3: Travel time analysis by hours of the day

    from 5 pm to 7 pm are relatively higher than others. It can be expected that travel times

    on a segment happening in peak hours are more similar to those in other peak hours, and

    travel times in off-peak hours are also likely to be similar to each other. As a general

    rule, travel times which occurred around the same time of the day are more similar to

    each other. This forms the basis for the temporal neighbourhood scheme introduced in

    Chapter 4. According to the scheme, historical trajectories which occurred within a fixed

    temporal neighbourhood (of 30 minutes or 1 hour) of the test trajectory are more reliable

    for prediction that those outside the neighbourhood.

    18

  • 7/22/2019 B Tech Project Thesis

    30/69

    Figure 3.4: Comparison between weekday peak and weekday off-peak trips.

    Figure3.4 shows the space-time trajectories4 for all the 2,212 trajectories. The blue

    trajectories happened in the peak hours whereas the green ones happened in the off-peak

    hours during the weekdays. Clearly, the peak hour trajectories have more variance than

    those in off-peak hours.

    Figure3.5shows a heat-map that represents the correlation matrix which was obtained

    by binning all the historical travel times on Segment 28 into 14 bins (corresponding to the 14

    working hours in a day) and calculating the Pearsons correlation coefficients among them.

    The diagonal squares are all white (correlation = 1) since these represent the correlation of

    one bin with itself. It is clear from the heat-map that the squares closer to the diagonal are

    whiter than those away from the diagonal. This means that the historical travel times which

    occurred temporally closer (within a radius of 1-2 hours) to each other are more correlated

    to each other. This conclusion forms the basis of the temporal neighbourhood feature for

    the selection of similar historical trajectories, as discussed in Section4.7.

    Travel times are not only related to the hour they happen, but also to the day on which

    4A plot between the cumulative distance and the cumulative time taken to cover that distance.

    19

  • 7/22/2019 B Tech Project Thesis

    31/69

    Figure 3.5: Correlations between travel times occurring in different hours of the day

    20

  • 7/22/2019 B Tech Project Thesis

    32/69

    Figure 3.6: Comparison between weekday and weekend trips.

    they happen. To verify the correlations between travel times and the day, we classified the

    days into 2 classes namely, weekday and weekend. As Figure3.6indicates, travel times

    in weekdays have higher variance than those in weekends. Thus, the assumption of taking

    weekday/weekend as a discriminative feature for trajectory selection is also valid. A similar

    analyses across different days of the week is shown in Figure3.7and it can be seen that they

    are not distinctly different from each other and hence they were not separately analysed.

    From the above analyses, it can be concluded that several patterns exist in the travel

    times of buses moving on the same route. In the present case, the weekday/weekend pat-

    tern and the intra-day hourly pattern are the most significant. Based on these patterns, two

    schemes based on the temporal features of the trajectories are proposed in Chapter 4. From

    the correlation analysis, it was concluded that the correlation between closely spaced seg-

    ments is significant. This forms the basis for the passed segments scheme proposed in the

    next chapter.

    21

  • 7/22/2019 B Tech Project Thesis

    33/69

    Figure 3.7: Comparison between the weekdays.

    22

  • 7/22/2019 B Tech Project Thesis

    34/69

    CHAPTER 4

    THE FRAMEWORK AND THE CLUSTERING

    ALGORITHM

    Through the data analysis presented earlier in Chapter 3, we observed the correlations be-

    tween the segment travel times and the various trajectory features. Cluster analysis was

    adopted for the identification of the most correlated trips and is discussed in this chapter

    along with the other schemes based on temporal features. Using the identified trips as input,

    a novel travel time prediction framework, called Historical Trajectory and Kalman Filter

    based Arrival/Travel Time Prediction (HTKFTP), based on a large collection of historical

    bus trajectories is developed, the details of which are discussed in the next chapter. This

    chapter focuses on the historical trajectory selection part whereas the prediction algorithm

    is discussed in details in Chapter 5. The section below defines the necessary terminology

    that are used in this framework.

    4.1 Terms and Definitions

    Since the buses are travelling on fixed routes, the geometrical routes in a two-dimensional

    space can be represented in a one-dimensional space, where the position of each point on

    the route is the distance from the start of the route. A route can be considered as consisting

    of points on it and in a classical way, we choose a series of points to represent a route. In

    our case, each point on the route is at a distance from the origin which is a multiple of

    200 meters (along the route), so that the entire route is split into segments of 200 meters

    length. This choice, of having smaller segments to represent the route, was made to closely

    capture the pattern of segment-wise travel times along the route for a particular journey.

    The various terms used in this study, are defined below.

    1. A raw routeRraw is represented as a sequence of points, p0, p1,...,pn.

    Each point,pi , stands for the starting point of theith segment and its value denotes the

    total distance along the route from the starting point of the route to the end of(i 1)th

  • 7/22/2019 B Tech Project Thesis

    35/69

    segment. Thus,pi < pi+1. nis the total number of points on the route including the origin

    and destination.

    2. A segmentSi is a part of a route between two adjacent points piandpi+1.

    3. A routeR is represented by a sequence of segments,S0, S1,...,Sn1.

    The value ofSidenotespi+1-pi. Our goal is to predict the arrival times at the bus stops.

    Each bus stop has a latitude and longitude which lies on the route.

    4. A route is also represented by a sequence of bus stops, B0, B1,...,Bm.

    The value ofBi denotesS0+ ...+Sl1+ d if the bus stop Bi lies on the segmentSl.

    Here,d is the distance along the route frompl (end point of segment Sl1) to the locationof bus stop Bi. The trajectory data of a bus journey consists of a series of time-stamped

    locations of the bus on the route.

    5. A raw trajectoryTraw is represented as a sequencep0, t0, ..., pn, tn.

    piR andtidenotes the travel time fromp0to pi.

    6. A trajectoryTis a sequencet0,..., tN.

    ti denotes the travel time on Si andN ( =n 1) is the number of segments on the

    route. During a complete trajectory, a bus generates travel times on all the segments on the

    route. Therefore, given Mhistorical trajectories, there are Mtravel times for each segment.

    7. For a segment Si, there is a corresponding sequence of travel times,tSi0

    , ..., tSiM1.

    tSij is the travel time of a bus on this segment in the(j+ 1)th trajectory andMis the

    number of historical trajectories. This sequence is denoted asSTi for Si.

    For a particular route, all the historical trajectory data can be stored in a table format in

    which the columns represent theattributes of the trips such as start time from the origin,

    date of trip and the travel times on each segment whereas the rows (or records) represent

    the individual trajectories. As discussed later in this chapter, the historical travel times on

    a particular segment are clustered into smaller groups so as to minimize the within-cluster

    variance for each group.

    8. Given a sequence of historical travel timesS Ti for a segment Si, it can be split into

    a sequence of intervals (orclusters)SCi := CSi0 ,...,C

    SiK1.

    24

  • 7/22/2019 B Tech Project Thesis

    36/69

    Kis the number of travel time clusters for Si. Note that,Kis a random variable which

    depends on the variance of the historical travel times on the segment. The process from STi

    toSCi is explained later in this chapter.

    4.2 Problem Formulation

    Consider a bus route R:= S0,...,SN1with Nsegments. For a bus travelling on segment

    Si, its current (incomplete) trajectory Tcurr can be represented as a sequence of travel times

    on the passed segments, i.e. Tcurr := tcurr0 ,..., tcurri (0 i N). Let d be the

    distance of bus from pi+1 (end point ofS

    i) along the route. Suppose the bus stopB

    j at

    which to predict the bus arrival time lies onSl (l > i) anddbe the distance ofBj frompl

    (start point ofSl).

    GivenMhistorical trajectories andTcurr , we aim to develop an effective framework to

    predict the travel times ti, ..., tl on the segmentsSi,...,Sl. The arrival time of the

    bus atBi is given by,

    ABi =T+

    d

    Si

    . ti+ ti+1+ ... + tl1+

    dSl

    . tl (4.1)

    whereTis the current time and ABi is the arrival time atBi.

    4.3 Overview of the Framework

    In this section, we first provide an overview of the proposed HTKFTP system frameworkand then discuss the details of the cluster analysis (in the passed segments scheme) car-

    ried out for pattern identification. Figure 4.1 shows the system design of the HTKFTP

    framework. As illustrated, the proposed HTKFTP system (i.e., a location based service)

    continuously collects bus trajectory data from GPS-equipped buses which report the latest

    bus status including time-stamped geographical coordinates of the bus and instant speed.

    The HTKFTP server is responsible for receiving and storing the trajectory data, monitoring

    the incomplete trajectories of on-going buses, and making prediction of bus travel time on

    the routes in response to (i) passenger enquiries and (ii) real time updates of bus arrival

    25

  • 7/22/2019 B Tech Project Thesis

    37/69

    Figure 4.1: Overall architecture of the HTKFTP framework

    times at bus stops. As shown in Figure 4.1, the HTKFTP server consists of three modules:

    a) Real-time Bus Status Monitoring (RBSM) module; b) Travel Time Prediction (TTP)

    module; and c) Nearest Neighbour Search (NNS) module.

    The RBSM module is responsible for communicating with the buses to receive bus

    status information and GPS data updates of the on-going trajectories. Once an update from

    a bus b reaches the server, RBSM catches the status (such as current bus coordinate and

    new time stamp) ofb, extracts features associated with the developing trajectoryTb, andstores the information as part ofTbin the historical trajectory repository.

    The TTP module is responsible for predicting the arrival times of buses at bus stops,

    which can be reduced to a problem of predicting the travel times of buses on their remain-

    ing route segments. As mentioned, the TTP module can be invoked to make predictions by

    (i) a passenger enquiry; or (ii) the real-time updates of bus arrival information at stops. The

    former arrives on demand and the latter happens periodically. In this paper, for simplicity,

    we focus on predicting the travel time of a bus, given its current location, on remaining seg-

    ments of its journey on the bus route. Moreover, instead of constantly making predictions,

    26

  • 7/22/2019 B Tech Project Thesis

    38/69

    we assume that TTP is invoked every time when RBSM receives an update that the bus has

    crossed a segment (including the GPS data of bus location) and passes the required input

    parameters for prediction to TTP. Our idea behind the TTP module is to use a few best

    matches for the ongoing trajectory as inputs to Kalman Filter, which, efficiently predicts

    the travel times on the future segments by employing a robust mechanism.

    As illustrated in the figure, TTP relies on NNS module to search for similar trajectories

    effectively and efficiently. As there could be different ways to identify the sample set of

    similar trajectories, different notions of similarity could be explored to ensure the effective-

    ness of TTP. On the other hand, with a massive amount of historical data, it is infeasible to

    make exhaustive comparison between the trajectory of current bus journey against all the

    historical trajectories in the database. To ensure the search efficiency, we create indices of

    trajectories and related patterns in the NNS module to avoid retrieval of irrelevant trajec-

    tories that are not helpful for our travel time estimation. In other words, we only fetch a

    relatively small set of candidate trajectories and return them back to TTP.

    The HTKFTP is introduced above as a general framework to support travel time predic-

    tion. The remaining task is to devise similarity trajectory based prediction schemes which

    first invoke the NNS module to retrieve a sample set of trajectories for making effective

    travel time estimation in the TTP module. Based on our data analysis, we observed the

    travel time correlation between two segments and the travel time patterns corresponding to

    some temporal features such as hours and days. Therefore, we follow these observations

    to introduce two schemes based on passed segments and temporal features. As their names

    suggest, these two schemes use the passed segments (PS) and temporal features (TF) of an

    on-going bus journey, respectively, to identify similar trajectories for prediction.

    4.4 Trajectory Search based on Passed Segments Scheme

    In PS scheme, the prediction is done by finding the historical trajectories "similar" to the

    current one in terms of the travel times on the segments already crossed by the moving bus.

    Thus a similarity measuring algorithm has to be taken into consideration. As mentioned

    before, the conventional algorithms measuring similarity of time series such as, Lp-norm,Dynamic Time Warping(DTW) andLongest Common Subsequence(LCSS), are not appro-

    27

  • 7/22/2019 B Tech Project Thesis

    39/69

    priate in this project because those algorithms are highly sensitive to any error or outlier

    in the data. As a result, a slight variation in the collected data might result in dramatical

    mismatches between the current trajectory and historical trajectories. In addition, these

    algorithms only evaluate the overall similarity of the whole trajectory, rather than the sim-

    ilarity of trajectories with respect to each segment. To address the above problems, we

    propose a new similarity measure that takes into account the similarity between two tra-

    jectories on each segment. Given two trajectories,t0, ..., tn andt

    0, ..., t

    n, we

    compare each pair of travel timesti andt

    i. If the difference between each pair is less

    than a threshold specific to that segment, the two trajectories are considered "similar". This

    method improves the conventional distance measure algorithms in that, for two "similar"

    trajectories, not only the whole ones, but also the corresponding segments should be similar.

    However, this method is limited by the low efficiency that is caused by searching for

    similar travel times based on each segment, especially when the number of historical tra-

    jectories is large. To overcome this, we can allocate travel times into clusters and match the

    current travel time to the cluster averages to find the appropriate cluster.

    To better illustrate this problem, we provide the following example. For a specific route,

    given a number of historical trajectories on this route, we can create a table with attributes

    (columns) corresponding to the travel times of each segment of the route and a record

    corresponding to each historical trajectory (Table4.1).

    Table 4.1: An example of segment-wise travel times on historical trajectories.

    Trajectory ID Segment0 Segment1 Segment2 Segment3

    Trajectory1 20 155 63 29

    Trajectory2 32 89 61 33

    Trajectory3 15 262 55 37Trajectory4 93 90 73 21

    Trajectory5 68 75 77 26

    . . . . .

    . . . . .

    . . . . .

    TrajectoryM t0 t1 t2 t3

    By using the clustering algorithm that will be discussed next, we partition each column

    into several non-overlapping ranges. Each range contains at least one value and each valueonly falls in one range. This Table4.1can be transferred into a table as shown in Table4.2,

    where the number of ranges for each segment is not necessarily equal.

    28

  • 7/22/2019 B Tech Project Thesis

    40/69

    Table 4.2: An example of partitioned segment-wise travel times after application of cluster-

    ing algorithm.

    Segment0 Segment1 Segment2 Segment3

    15;20 75;89;90 55 21

    32 155;180 61;63 26;29

    68 262 69;73 33;36;37

    82;93 - 77 -

    4.5 The Clustering Algorithm

    In this section we consider an algorithm used to partition each sequence of travel time

    values,S Ti, into a sequence of clustersSCi. Since,S Ti is a sequence of numerical values

    and can be represented in a one-dimensional space. Splitting such a sequence is actually to

    allocate a set of one-dimensional data into clusters. Leeet al.(2012) compared two robust

    clustering algorithms namely,K-meansandV-clustering. As they pointed out in their work,

    the K-means algorithm has two limitations. Firstly, the initial cluster centroids are chosen

    randomly and different choices may cause different clustering results. Another issue is how

    to determine the value of K. With no common direction on this problem, it is hard to offer

    a perfect value of K. They also found V-clustering to be performing better (in the passedsegment scheme) than K-means with the help of experiments with real-world data. So, in

    this study, we concentrate only on the V-clustering algorithm.

    This V-clustering algorithm was introduced by Yuan et al. (2010) to allocate a sorted

    list of one-dimensional data into clusters. In this algorithm, a list of values is first sorted.

    Then it is split into clusters in an iterative manner. At each iteration, the list is split into two

    parts and the weighted average variance (WAV) is calculated for the resulting child lists.

    An optimum split is found out that minimizes the WAV of the resulting child lists. The

    WAV for a split at theith element of the list is defined in Equation4.2.

    wavi=

    L1i

    L

    V ar(L1i) +L2

    i

    L

    V ar(L2i) (4.2)

    where |Li1| and |Li2| are the cardinalities of the resulting child lists for the ith split and

    V ar(Li1)and V ar(Li2)are their respective variances. The list is recursively partitioned so

    that the running time of the clustering algorithm for a segment with M historical travel

    times becomes O(log M). Hence, the running time for the entire trajectory database is

    29

  • 7/22/2019 B Tech Project Thesis

    41/69

    O(Nlog M) (which is fast), where N is the total number of segments on the route. The

    iteration is stopped when each cluster is left with a minimum number of travel times (or

    minimum number of trajectories, MNT) which is a tunable parameter (i.e. its value is

    decided to strike a balance between minimizing the errors in prediction and maximizing the

    computational speed). Each cluster for a segment is associated with a cluster average, i.e.,

    the average of all the travel times in it. The selection of the values of various parameters of

    the clustering algorithm is made after the experiments with the real-world data as discussed

    in Chapter 6.

    4.6 Nearest Neighbour Search in Passed Segments Scheme

    Given a current trajectory t0, t1, t2, with the passed segments S0, S1 andS2, let

    t2 falls in a certain cluster for S2 and we take it as the match to t2. All trajectories

    whose travel times onS2fall in the matching cluster are marked. Matching is usually done

    by finding the cluster for the particular segment, whose cluster average is closest to t2

    (which is the current trajectorys actual travel time on S2). The same operation is applied

    toS1andS0 and then we can find trajectories whose travel times of the three past segmentfall in all the matched clusters. This method, known as Segment filtering, was introduced

    byLeeet al.(2012). Since for each of the three passed segments, the historical trajectories

    travel times are similar to the current trajectory, they can be considered as "similar" to the

    current trajectory and can be used for prediction. However, the Segment filtering method

    has a limitation when the number of historical trajectories is small (i.e.

  • 7/22/2019 B Tech Project Thesis

    42/69

    searched to find the match.

    4.7 Similarity based on Temporal Features

    Besides passed segment (PS) scheme, we also propose a scheme that uses features inside

    the historical data that are directly related to the travel time. Using similar trajectories found

    from the PS scheme, we can provide satisfactory predictions. However, this method cannot

    guarantee accuracy under all the circumstances. For example, when unusual events happen

    on a future segment, it is hard to make a reliable prediction from historical data because the

    events might never have happened in the history (limited by the amount of historical data

    collected). Fortunately, resorting to features related to traffic information on the current

    segment, we can make predictions of travel times by first selecting trajectories by matching

    the temporal features and then using the PS scheme. For example, the time when a bus

    enters a segment is important because the traffic changes along with time of the day. It is

    common that during peak hours in the morning and evening, congestions happen with a

    high probability. Also, intuitively, the segment-wise travel times on two trajectories close

    together temporally may bear a high correlation with each other. As verified by Figure 3.6in Chapter 3, the weekday and weekend trips have different variances in their space-time

    trajectories. Hence, in the final hybrid scheme, in order to make predictions for an ongoing

    trip, the day on which it is occurring is first used to select weekday or weekend trajectories.

    On this set of trajectories the TF (temporal neighbourhood) and PS schemes are applied in

    sequence to find the final set of similar trajectories.

    4.8 Summary

    The final refined set of similar trajectories that results from the application of the hybrid

    scheme introduced above, is used for prediction of travel times on the upcoming segments

    of the ongoing trajectory. Experiments to support our claim that the hybrid scheme is

    more effective in prediction than the individual ones, were carried out with real-world data

    as discussed in Chapter 6 on performance evaluation. The prediction algorithm based onKalman Filter and the modifications made to the base Kalman algorithm are explained in

    details in the next chapter.

    31

  • 7/22/2019 B Tech Project Thesis

    43/69

    CHAPTER 5

    THE PREDICTION ALGORITHM

    In Chapter 4, we explored the various schemes to search for similar historical trajectories

    that are effective for travel time prediction (TTP). The problem now, is to use the travel

    times of the identified similar trajectories to predict for the current trajectory. The simplest

    way to predict the travel time on an upcoming segment of the current trip is to use the mean

    (or median) of all the travel times from the identified similar trajectories on the correspond-ing upcoming segment. Another way is to give weights to the individual trajectories before

    calculating the mean. The weight given to a similar trajectory can be the inverse square of

    its Euclidean distance1 from the ongoing trajectory, as explained in Larose(2005). How-

    ever, in this study, we focus on a robust, short-term prediction technique based on Kalman

    Filter (KF) which can take into account the associated variability to a certain extent. Before

    moving on, we review some previous work involving Kalman Filter for travel time predic-

    tion including those which were attempted in heterogeneous traffic conditions prevalent in

    India.

    5.1 Travel Time Prediction using Kalman Filter

    The first introduction of Kalman Filter dates back to 1960, whenKalman(1960) published

    his famous paper describing a recursive solution to the discrete-data linear filtering problem.

    The Kalman filter is a set of mathematical equations that provides an efficient computational

    (recursive) means to estimate the state of a process, in a way that minimizes the mean of

    the squared error. As mentioned inWelch and Bishop(2006), the filter is very powerful in

    several aspects: it supports estimations of past, present, and even future states, and it can

    do so even when the precise nature of the modelled system is unknown.

    In the literature of travel time prediction,Chein and Kuchipudi(2002),Liuet al.(2006),

    Nanthawichit et al.(2003),Chen and Chein(2001) andYang(2005) are some of the earli-

    1Square root of the sum of squares of differences between the corresponding segments of the historical

    and the current trajectory (till the segments crossed in the current trip)

  • 7/22/2019 B Tech Project Thesis

    44/69

    est to introduce KF.Nanthawichitet al.(2003) andYang(2005) explored the possibility of

    using GPS probe vehicle data into KF for travel time prediction. Vanajakshi et al. (2009)

    is one of the earliest attempts that used KF with GPS probe vehicle data for short-term

    travel time prediction under heterogeneous traffic conditions such as those prevalent in In-

    dia. From their travel time variation plots (across the route), the authors concluded that

    the travel time patterns along the route were more related for consecutive vehicles (with a

    headway15 min) on the same day. Weekly and daily patterns were not as significant as

    the above one. Hence, they used the travel times of the previous two vehicles for predicting

    the travel time of the test vehicle (the ongoing trip). However, when the headways between

    the consecutive vehicles are more (1 hour), the accuracy of the approach decreases (This

    is a serious issue when the previous vehicles passed during an off-peak hour and the test

    vehicles passes in a peak hour, or vice versa.). Since, inputs to KF in this method are fixed

    most of the times, there is no means to rectify the accuracy once one of the above mentioned

    issues creep in. Hence, there was a need to modify the method such that it uses dynamic in-

    puts for prediction in order to address the prevalent traffic condition at the moment. This is

    where the similar trajectory search as discussed in Chapter 4, can be helpful. Based on the

    latest actual travel times of the test vehicle, the trajectory search algorithm finds all the his-

    torical trajectories which occurred under the same traffic conditions as the current one. In

    Chapter 6, we prove that the dynamic input method outperforms the static input method by

    using real-world data. In the following section, we discuss the base KF algorithm as men-

    tioned inVanajakshiet al.(2009). In the subsequent sections, we discuss the changes made

    to both the base KF algorithm and the trajectory search algorithm to effectively integrate

    them for travel time prediction.

    5.2 The Base KF Algorithm

    It is assumed that the evolution of travel time between the various segments is governed by,

    ti+1=aiti+ wi (5.1)

    whereti is the travel time taken for coveringSi (theith subsection),ai a parameter that

    relates the travel time taken inSi to the travel time taken inSi+1andwi the process distur-

    33

  • 7/22/2019 B Tech Project Thesis

    45/69

    bance associated withSi. The measurement process was assumed to be governed by,

    zi= ti+ vi (5.2)

    wherezi is the measured time of travel inSi andvi the measurement noise. It was further

    assumed thatwi andvi are zero mean white Gaussian noise signals with Qi andRi being

    their corresponding variances.

    The prediction algorithm requires as input, at least two trajectories in the form of

    segment-wise travel times. Trajectory which is more similar to the current one is called

    basetrajectory (denoted byTbase) and the other one is calledcorrectiontrajectory (denoted

    byTcorr). The data obtained fromTbase was used to obtain the value ofai for each subsec-

    tion. The data fromTcorr were used in the prediction algorithm to obtain the estimate of

    travel time of the test (or the ongoing) trajectory (denoted by Ttest). Following are the steps

    involved in the algorithm:

    1. The travel time data from Tbase was used to obtain the value ofai through ai =

    tTbasei+1 /tTbasei ,i = 1, ..., (N1), wheret

    Tbasei is the travel time taken inTbase to

    coverSi.

    2. The discretisation is carried out over space rather than over time (as is done in tradi-

    tional applications of the KF). Let tTtesti denote the travel time taken by inTtest to

    coverSi. It is assumed thatE[tTtest1

    ] = t1, andE[(tTtest1 t1)

    2] =P1, where

    t1is the estimate of the travel time in TtestonSi.

    3. Fori = 2, ..., (N1), the following steps are performed:

    (a) The a priori estimate of the travel time is calculated using ti+1 = ait+

    i ,

    where the superscript - denotes the a priori estimate and the superscript +

    denotes the a posteriori estimate.

    (b) The a priori error variance (denoted by P) was calculated using Pi+1= aiP+

    i ai+

    Qi:

    (c) The Kalman gain (denoted byK) was calculated usingKi+1= P

    i+1

    Pi+1

    +Ri+1:

    (d) The a posteriori travel time estimate and error variance were calculated using,

    34

  • 7/22/2019 B Tech Project Thesis

    46/69

    respectively, t+i+1= ti+1+ Ki+1[zi+1

    ti+1]andP+

    i+1= [I Ki+1]P

    i+1,

    where the data measured fromTcorr was used for providing the values ofzi+1

    in the equation to calculate t+i+1.

    Thus, the objective here is to predict the travel times ofTtest using the travel time data

    obtained fromTbase andTcorr . When theTtestis inSi, its travel time for Si+1, which is de-

    noted byti+1, is predicted. The KF algorithm works like a predictor-corrector algorithm.

    The a posteriori estimate ofti of theTtest is used to obtain the a priori estimate ofti+1

    (this being the prediction step) and then the measurement of the travel time, Tcorr inSi+1

    (which is denoted by zi+1in the equations in step 4d above) is used to obtain the a posteriori

    estimate ofti+1ofTtest (this being the correction step). In the following section, we dis-

    cuss the modifications made to the trajectory search algorithm and the base KF algorithm

    in order to integrate them and to tackle a few issues concerned with the variance of travel

    times.

    5.3 Integration of Trajectory Search and Prediction algo-

    rithms

    As we discussed in the previous section, the KF based algorithm needs only two best

    matched trajectories for travel time prediction. Based on the actual travel times received

    in real-time from the test vehicle, the trajectory search algorithm finds similar historical

    trajectories with travel time patterns matching that of the current one. The task now, is to

    rank the matched trajectories based on some metric and send the top two to the prediction

    algorithm. To accomplish this, for each matched trajectory, its Euclidean distance from the

    test trajectory is found out using the equation,

    ED =

    (tTtest1 t

    Thist1 )

    2 + (tTtest2 tThist2 )

    2 + ... + (tTtestm tThistm )2 (5.3)

    whereED is the Euclidean distance between the test trajectory and a matched historical

    trajectory, tTtesti is the travel time onSi for the test vehicle, tThisti is the travel time on

    Si for a matched historical trip and m the number of segments crossed by the test vehicle

    when the request is made. The above Euclidean distance gives the measure of similarity

    35

  • 7/22/2019 B Tech Project Thesis

    47/69

    Figure 5.1: Variation of travel time variance across the segments of 19B route

    between two trajectories with respect to their individual segment travel times. The matched

    trajectories are now ranked according to the increasing values of their EDs from the test

    trajectory. The top two are sent to the prediction algorithm. As the test vehicle moves from

    one segment to the next one, with the newly available actual travel time of test vehicle, the

    trajectory search algorithm again finds the best matches from history, ranks them and sends

    the top two to the prediction algorithm, which updates the previous predictions with more

    accurate ones, thus making the process dynamic in nature.

    5.4 Modifications

    High variances in travel times during certain periods of the day and on certain segments,

    leading to higher prediction errors on selected trips or segments was the main issue faced

    by the existing algorithm. As can be seen in the box plots in Figure 3.3(Chapter 3), during

    the peak hours, besides the median travel time (thick line inside the box), the variance of

    travel times also increases (indicated by increased height of the box). Figure5.1below,

    shows that the variance of travel times is also high in certain segments on the route. Each

    line in the plot is obtained by calculating the variance of travel times at each segment for

    the trips occurred in a two hour band in the history.

    To address the high variance (to some extent), theQi andRi values which represent

    36

  • 7/22/2019 B Tec