high-performance analyygtics on large- scale gps taxi trip ...hgong/gps/zhang.pdf · – taxi trip...
TRANSCRIPT
![Page 1: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/1.jpg)
High-Performance Analytics on Large-g y gScale GPS Taxi Trip Records in NYC
Jianting Zhang
Department of Computer ScienceThe City College of New YorkThe City College of New York
![Page 2: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/2.jpg)
OutlineOutline
•Background and Motivation•Background and Motivation
•Parallel Taxi data management on GPUsParallel Taxi data management on GPUs
•Efficient shortest path computation and applications
![Page 3: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/3.jpg)
Background and Motivation
Taxicabs•13,000 Medallion taxi cabs•License priced at $600, 000 in 2007•Car services and taxi services
Taxi trip records•~170 million trips (300 million
are separate
3
p (passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC
![Page 4: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/4.jpg)
Background and MotivationBackground and MotivationOver all distributions of trip distance, time, speed and fare (2009)
Count-Distance Distribution
15000000
20000000
nt
Count-Time Distribution
1500000020000000
nt
0
5000000
10000000
<= 0
.0
( 0.
8, 1
.0]
( 1.
8, 2
.0]
( 2.
8, 3
.0]
( 3.
8, 4
.0]
( 4.
8, 5
.0]
( 5.
8, 6
.0]
( 6.
8, 7
.0]
( 7.
8, 8
.0]
( 8.
8, 9
.0]
( 9.
8, 1
0.0]
( 10.
8, 1
1.0]
( 11.
8, 1
2.0]
( 12.
8, 1
3.0]
( 13.
8, 1
4.0]
( 14.
8, 1
5.0]
( 15.
8, 1
6.0]
( 16.
8, 1
7.0]
( 17.
8, 1
8.0]
( 18.
8, 1
9.0]
( 19.
8, 2
0.0]
Cou
n
05000000
10000000
<= 0
.0
( 2.
0, 3
.0]
( 5.
0, 6
.0]
( 8.
0, 9
.0]
11.
0, 1
2.0]
14.
0, 1
5.0]
17.
0, 1
8.0]
20.
0, 2
1.0]
23.
0, 2
4.0]
26.
0, 2
7.0]
29.
0, 3
0.0]
32.
0, 3
3.0]
35.
0, 3
6.0]
38.
0, 3
9.0]
41.
0, 4
2.0]
44.
0, 4
5.0]
47.
0, 4
8.0]
> 50
.0
Cou
n
( ( ( ( ( ( ( ( ( (Trip Distance (mile)
( ( ( ( ( ( ( ( ( ( ( ( (
TripTime (Minute)
Count-Speed Distribution
15000000
20000000
Count-Fare Distribution
2500000030000000
0
5000000
10000000
15000000
= 0
.0,
2.0]
, 4.
0],
6.0]
, 8.
0] 1
0.0]
12.
0] 1
4.0]
16.
0] 1
8.0]
20.
0] 2
2.0]
24.
0] 2
6.0]
28.
0] 3
0.0]
32.
0] 3
4.0]
36.
0] 3
8.0]
40.
0] 4
2.0]
44.
0] 4
6.0]
48.
0] 5
0.0]
Cou
nt
05000000
10000000150000002000000025000000
= 0
.0 2
.0]
4.0
] 6
.0]
8.0
]10
.0]
12.0
]14
.0]
16.0
]18
.0]
20.0
]22
.0]
24.0
]26
.0]
28.0
]30
.0]
32.0
]34
.0]
36.0
]38
.0]
40.0
]42
.0]
44.0
]46
.0]
48.0
]50
.0]
Coun
t
<(
1.0
( 3.
0(
5.0
( 7.
0(
9.0,
( 11.
0,( 1
3.0,
( 15.
0,( 1
7.0,
( 19.
0,( 2
1.0,
( 23.
0,( 2
5.0,
( 27.
0,( 2
9.0,
( 31.
0,( 3
3.0,
( 35.
0,( 3
7.0,
( 39.
0,( 4
1.0,
( 43.
0,( 4
5.0,
( 47.
0,( 4
9.0,
Speed (MPH)
<=(
1.0,
( 3.
0,(
5.0,
( 7.
0,(
9.0,
( 1
1.0,
( 1
3.0,
( 1
5.0,
( 1
7.0,
( 1
9.0,
( 2
1.0,
( 2
3.0,
( 2
5.0,
( 2
7.0,
( 2
9.0,
( 3
1.0,
( 3
3.0,
( 3
5.0,
( 3
7.0,
( 3
9.0,
( 4
1.0,
( 4
3.0,
( 4
5.0,
( 4
7.0,
( 4
9.0,
Fare ($)
![Page 5: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/5.jpg)
Background and MotivationOther types of Origin-Destination (OD) Data
Social network activities
Cellular phone calls
5
![Page 6: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/6.jpg)
Background and Motivation
• How to manage OD data? – Geographical Information System (GIS)Geographical Information System (GIS)– Spatial Databases (SDB)– Moving Object Databases (MOD)Moving Object Databases (MOD)
• How good are they? P tt d f ll t f d t – Pretty good for small amount of data
– But, rather poor for large-scale data
![Page 7: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/7.jpg)
Background and MotivationBackground and Motivation• Example 1:
– Loading 170 million taxi pickup locations into PostgreSQL– UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326);
105 8 hours!– 105.8 hours!
• Example 2: – Finding the nearest tax blocks for 170 million taxi pickup locationsFinding the nearest tax blocks for 170 million taxi pickup locations
using open source libspatiaindex+GDAL
– 30.5 hours!
I do not have time to wait... C d b tt ?Can we do better?
![Page 8: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/8.jpg)
Background and MotivationMulticore CPUs
Cloud computing+MapReduce+HadoopCloud computing MapReduce Hadoop
GPGPU Computing: From Fermi to Kepler
![Page 9: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/9.jpg)
http://www.amazon.com/EVGA-SuperClocked-Dual-Link-Graphics-06G-P4-2791-KR/dp/B00BL8BX7O
Nvidia GTX Titan GPU sold at Amazon
4.5 Teraflops of single precision and 1.3 Teraflops of double precision
![Page 10: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/10.jpg)
ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel
•Feb. 2013 •7 1 billion transistors (551mm²)(sustained) system with 9298 Intel
Pentium II Xeon processors (in 72 Cabinets)$$$?
7.1 billion transistors (551mm ) •2,688 processors •Max bandwidth 288.4 GB/s•PCI-E peripheral device$$$?
Space?Power?
p p•250 W (17.98 GFLOPS/W -SP)• Suggested retail price: $999
Wh t d t d i d i th t iWhat can we do today using a device that is more powerful than ASC I read 16 years ago?
![Page 11: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/11.jpg)
Background and MotivationBackground and Motivation
• The goal is to design a data management system to• The goal is to design a data management system to efficiently manage large-scale OD/trajectory data on massively data parallel GPUs y p
• With the help of new data models, data structures and algorithmsg
• To cut the runtimes from hours to seconds on a single commodity GPU device
• And support interactive queries and visual explorations
![Page 12: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/12.jpg)
OutlineOutline
•Background and Motivation•Background and Motivation
•Parallel Taxi data management on GPUsParallel Taxi data management on GPUs
•Efficient shortest path computation and applications
![Page 13: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/13.jpg)
O-D data management on GPUsO D data management on GPUsDay
M th
Physical Data Layout
Month
Year Raw data
Compression, aggregation and indexing
Spatial Joins and Shortest Path ComputationSpatial Joins and Shortest Path Computation
(Zhang, Gong, Kamga and Gruenwald, 2012)
![Page 14: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/14.jpg)
O-D data management on GPUsO D data management on GPUsTrip_Pickup_LocationTrip_Dropoff_Location
time_between_servicedistance between service
Start_Zip_CodeEnd_Zip_Code
78 9
Trip Pickup DateTimeStart Lon
d s a ce_be wee _se v ce
start_xt t 3Trip_Pickup_DateTime
Trip_Dropoff_DateTimeStart_LonStart_LatEnd_LonEnd_Lat
Passenger Count
start_yend_xend_y (local
j ti )2
3
5
Medallion#Shift#Trip#
Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time
Trip_Distance
projection)
1 4
5
6
Payment_TypeSurchargevendor_name
d t l d d 1110
Total_AmtRate_Code
date_loadedstore_and_forward
11
(Zhang, Gong, Kamga and Gruenwald, 2012)
![Page 15: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/15.jpg)
O-D data management on GPUsO D data management on GPUsYear
City
Month Day of the Year
Week of the Year
y
Borough
Community T
Top level grid
DayHour
Day of the Week
DistrictPolice Precinct
Census Tract
Census
Tax Lot
TaxLevel k
grid Peak/Block
Street Segment
Tax Block
Level 0 grid
grid
15/30-minutes
Peak/off-peak
Pickup/drop-off locations Pickup/drop-off timestamps
NYC taxi trip records Auxiliary data (weather, events…)
(Zhang, You and Gruenwald, 2012)
![Page 16: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/16.jpg)
O-D data management on GPUs(Zhang, Gong, Kamga and Gruenwald, 2012)
![Page 17: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/17.jpg)
O D data management on GPUsO-D data management on GPUs
P2P-TP2N‐D P2P-D
(Zhang, You and Gruenwald, 2012)
![Page 18: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/18.jpg)
O-D data management on GPUsO D data management on GPUsSingle-Level Grid-File based Spatial Filtering
Nested-Loop based Refinement
Vertices (polygon/
Points
•Perfect coalesced (polygon/polyline)
memory accesses •Utilizing GPU floating point power
(Zhang, You and Gruenwald, 2012)
![Page 19: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/19.jpg)
O-D data management on GPUsData• Data– Taxi trip records: 300 million in two years (2008-2010),
~170 million in 2009 (~150 million in Manhattan) – NYC DCPLION street network data: 147,011 street
segments – NYC Census 2000 blocks: 38,794– NYC MapPluto Tax blocks: 735,488 in four boroughs
(excluding SI) and 43,252 in Manhattan• Hardware
– Dell T5400 Dual Quadcore CPUs with 16 GB memory– Nvidia Quadro 6000 with 448 cores and 6 GB memory
![Page 20: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/20.jpg)
O-D data management on GPUs
Top: grid size =256*256resolution=128 feetresolution=128 feet Right: grid size =8192*8192resolution=4 feet
Spatial AggregationSpatial Aggregation
9,424 /326=30X (8192*8192)
Temporal Aggregation
1709/198=8.6X (minute)
1598 /165 = 9.7X (hour)(Zhang, You and Gruenwald, 2012)
![Page 21: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/21.jpg)
O-D data management on GPUs
735 488 t147,011 street segments
38,794 census blocks (470941
i )
735,488 tax blocks (4,698,986 points)
P2P-TP2N‐D P2P-Dpoints)
P2N-D P2P-T P2P-D
15 2 h 30 5 hCPU time
Algorithmic improvement: 3.7X- 15.2 h 30.5 h
10.9 s 11.2 s 33.1 s
CPU time
GPU TimeUsing main-memory: 37.4X
Parallel Acceleration:- 4,900X 3,200X Speedup
(Zhang, You and Gruenwald, 2012), (Zhang and You 2012a) Zhang and You 2012b)
Parallel Acceleration: 24.3X
![Page 22: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/22.jpg)
OutlineOutline
•Background and Motivation•Background and Motivation
•Parallel Taxi data management on GPUsParallel Taxi data management on GPUs
•Efficient shortest path computation and applications
•Mapping Betweenness Centralities
•Outlier Detection
![Page 23: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/23.jpg)
SP computation and applicationsp pp-overview
• Shortest path computation• Shortest path computation– Dijkstra and A*
New generation algorithms– New generation algorithms– Contraction Hierarchy (CH) based
•Open source implementations of CH:
M N•MoNav
•OSRM
•Much faster than•Much faster than ArcGIS NA module
![Page 24: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/24.jpg)
SP computation and applications
• Network Centrality (Brandes, 2008)
st vC )()(Node
Vtvs st
stB vC
)()(based
Edge st eC )()(Edge based
Vtvs st
stB eC
)()(
C b il d i d ft h t t th t d•Can be easily derived after shortest paths are computed
•Mapping node/edge between centrality can reveal theMapping node/edge between centrality can reveal the connection strengths among different parts of cities
![Page 25: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/25.jpg)
SP computation and applicationsSP computation and applicationsMapping Betweenness Centralities
•166 million trips, 25 million unique
•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2 0 GHZ)on a single CPU core (2.0 GHZ)•4200 pairs computation per second •3 orders of magnitude faster than gArcGIS NA
Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map
![Page 26: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/26.jpg)
SP computation and applicationsMapping Betweenness Centralities (All hours)
![Page 27: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/27.jpg)
SP computation and applicationsMapping Betweenness Centralities (bi-hourly)pp g ( y)
00H 02H 04H 06H 08H 10H
12H 14H 16H18H
20H 22H
Legend:
![Page 28: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/28.jpg)
SP computation and applicationsThe data is not as clean as we had thought…outlier detectiong
Trip_Pickup_LocationTrip_Dropoff_Location
time_between_servicedistance_between_service
Start_Zip_CodeEnd_Zip_Code
78 9
Trip Pickup DateTimeStart Lon
_ _
start_xstart y 3Trip_Pickup_DateTime
Trip_Dropoff_DateTimeStart_LonStart_LatEnd_LonEnd_Lat
Passenger Count
start_yend_xend_y (local projection)2
5
Medallion#Shift#Trip#
Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time
Trip_Distance1 46
Payment_TypeSurchargeT l A
vendor_namedate loaded 11
10Meshed up on purpose due to
Total_AmtRate_Code
date_loadedstore_and_forward
p pprivacy concerns
![Page 29: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/29.jpg)
SP computation and applicationsOutlier Detection
•Existing approaches for outlier detection for urban computingExisting approaches for outlier detection for urban computing
•Thresholding: e.g. 200m < dist < 30km
•Locating in unusual ranges of distributions•Locating in unusual ranges of distributions
•Spatial analysis: within a region or a land use type
•Matching trajectory with road segments – treat unmatched ones as outliers
S t h i i l t GPS t hil l•Some techniques require complete GPS traces while we only have O-D locations
•Large scale shortest path computing has not been used for outlier•Large-scale shortest path computing has not been used for outlier detection
![Page 30: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/30.jpg)
SP computation and applicationsSP computation and applicationsThe data is not as clean as we had thought…outlier detection
I dditi• In addition:– Some of the data fields are empty– Pickup and drop-off locations can be p p
in Hudson River – The recorded trip distance/duration
can be unreasonable– ...
O tli d t ti f d t l i d dOutlier detections for data cleaning are needed
![Page 31: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/31.jpg)
SP computation and applicationsRaw Taxi trip data
Type I outlier Match pickup/drop-off point locations to street
segments within Distance D0
(spatial analysis)
Successful?Aggregate unique
Assign pickup/drop-off nodes by picking closer ones
Compute shortest path Type II outlier
gg g q(sid,tid) pairs
CD >D1 AND CD>W*RD?
(network analysis)
CD: Compute shortest distanceRD: Recorded trip distanceUpdate centrality measurements
![Page 32: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/32.jpg)
SP computation and applicationsSP computation and applications-Discussion on outlier detection
• The approach is approximate in nature– Taxi drivers do not always follow shortest path
i ll f h i d h il d– Especially for short trips and heavily congested areas – But we only care about aggregated centrality
measurements and the errors have a chance to be cancelled out by each other
• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segmentsthe locations might be mismatched with segments
• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.outliers but may generate false positives.
![Page 33: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/33.jpg)
SP computation and applicationsp pp-Results of outlier detection
Examples of Detected Type II Outliers
•D 200 feet D 3 miles W 2•D0=200 feet, D1=3 miles, W=2 •~2.5 millions (1.5%) type I outliers•~ 18,000 type II outliers
![Page 34: High-Performance Analyygtics on Large- Scale GPS Taxi Trip ...hgong/GPS/Zhang.pdf · – Taxi trip records: 300 million in two years (2008-2010), ~170 million in 2009 (~150 million](https://reader033.vdocuments.net/reader033/viewer/2022042908/5f37bfcbbda2ae00832493a3/html5/thumbnails/34.jpg)
Ongoing Research @ GeoTECIOngoing Research @ GeoTECI•CudaGIS - a general purposed GIS on GPUs
•More GIS modules in addition to indexing/query processing
•Trajectory data management on GPUsj y g•Online moving point location updating
•Segmentation/simplification/compression
M t hi ith d t k•Matching with road networks
•Aggregation and warehousing
•Indexing and query processing similarity join
•Data mining (moving cluster, convoy, swarm...)