analyzing nyc transit data
TRANSCRIPT
![Page 1: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/1.jpg)
Analyzing NYC Transit Data:Taxis, Ubers, and Citi Bikes
Todd SchneiderApril 8, 2016
![Page 2: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/2.jpg)
Where to find me
toddwschneider.com
github.com/toddwschneider
@todd_schneider
toddsnyder
![Page 3: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/3.jpg)
Things I’ll talk about
• Taxi, Uber, and Citi Bike data
• Medium data analysis tools and tips
• Where does R fit in?
![Page 4: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/4.jpg)
Taxi and Uber Data
http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
![Page 5: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/5.jpg)
Citi Bike Data
http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/
![Page 6: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/6.jpg)
NYC Taxi and Uber Data
• Taxi & Limousine Commission released public, trip-level data for over 1.1 billion taxi rides 2009–2015
• Some public Uber data available as well, thanks to a FOIL request by FiveThirtyEight
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
![Page 7: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/7.jpg)
Citi Bike Data
• Citi Bike releases monthly data for every individual ride
• Data includes timestamps and locations, plus rider’s subscriber status, gender, and age
https://www.citibikenyc.com/system-data
![Page 8: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/8.jpg)
Generic Analysis Overview
1. Get raw data
2. Write code to process raw data into something more useful
3. Analyze data
4. Write about what you found out
![Page 9: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/9.jpg)
Analysis Tools• PostgreSQL
• PostGIS
• R
• Command line
• JavaScript
https://github.com/toddwschneider/nyc-taxi-data https://github.com/toddwschneider/nyc-citibike-data
![Page 10: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/10.jpg)
Raw data processing goals• Load flat files of varying file formats into a unified,
persistent PostgreSQL database that we can use to answer questions about the data
• Do some one-time calculations to augment the raw data
• We want to answer neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts
![Page 11: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/11.jpg)
Processing raw data:The reality
• Often messy, raw data can require massaging
• Not fun, takes a while, but is essential
• Specifically: we have to plan ahead a bit, anticipate usage patterns, questions we’re going to ask, then decide on schema
![Page 12: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/12.jpg)
Raw Data
![Page 13: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/13.jpg)
Specific issues encountered with raw taxi data
• Some files contain empty lines and unquoted carriage returns 😐
• Raw data files have different formats even within the same cab type 😕
• Some files contain extra columns in every row 😠
• Some files contain extra columns in only some rows 😡
![Page 14: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/14.jpg)
How do we load a bunch of files into a database?
• One at a time!
• Bash script loops through each raw data file, for each file it executes code to process data and insert records into a database table
https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh
![Page 15: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/15.jpg)
How do we map latitude and longitude to census tracts?
• PostGIS!
• Geographic information system (GIS) for PostgreSQL
• Can do calculations of the form, “is a point inside a polygon?”
• Every pickup/drop off is a point, NYC’s census tracts are polygons
![Page 16: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/16.jpg)
NYC Census Tracts
• 2,166 tracts
• 196 neighborhood tabulation areas (NTAs)
![Page 17: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/17.jpg)
Shapefiles
• Shapefile format describes geometries like points, lines, polygons
• Many shapefiles publicly available, e.g. NYC provides a shapefile that contains definitions for all census tracts and NTAs
• PostGIS includes functionality to import shapefiles
![Page 18: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/18.jpg)
Shapefile Example
![Page 19: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/19.jpg)
PostGIS: ST_Within()
• ST_Within(geom A, geom B) function returns true if and only if A is entirely within B
• A = pickup or drop off point
• B = NYC census tract polygon
![Page 20: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/20.jpg)
Spatial Indexes
• Problem: determining whether a point is inside an arbitrary polygon is computationally intensive and slow
• PostGIS spatial indexes to the rescue!
![Page 21: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/21.jpg)
Spatial indexes in a nutshell bounding box
Bounding box
Census tract
![Page 22: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/22.jpg)
Spatial Indexes• Determining whether a point is inside a rectangle is easy!
• Spatial indexes store rectangular bounding boxes for polygons, then when determining if a point is inside a polygon, calculate in 2 steps:
1. Is the point inside the polygon’s bounding box?
2. If so, is the point inside the polygon itself?
• Most of the time the cheap first check will be false, then we can skip the expensive second step
![Page 23: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/23.jpg)
Putting it all together• Download NYC census tracts shapefile, import
into database, create spatial index
• Download raw taxi/Uber/Citi Bike data files and loop through them, one file at a time
• For each file: fix data issues, load into database, calculate census tracts with ST_Within()
• Wait 3 days and voila!
![Page 24: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/24.jpg)
Analysis, a.k.a.“the fun part”
• Ask fun and interesting questions
• Try to answer them
• Rinse and repeat
![Page 25: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/25.jpg)
Taxi maps
• Question: what does a map of every taxi pickup and drop off look like?
• Each trip has a pickup and drop off location, plot a bunch of dots at those locations
• Made entirely in R using ggplot2
![Page 26: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/26.jpg)
Taxi maps
![Page 27: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/27.jpg)
Taxi maps preprocess
• Problem: R can’t fit 1.1 billion rows
• Solution: preprocess data by rounding lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215
![Page 28: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/28.jpg)
Render maps in Rhttps://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R
![Page 29: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/29.jpg)
Data reliability
Every other comment on reddit:
![Page 30: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/30.jpg)
• Map the position of every Citi Bike over the course of a single day
• Google Maps Directions API for cycling directions
• Leaflet.js for mapping
• Torque.js by CartoDB for animation
Citi Bike Animation
![Page 31: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/31.jpg)
• Google Maps cycling directions have strong bias for dedicated bike lanes on 1st, 2nd, 8th, and 9th avenues
• Not necessarily true!
Citi Bike Assumptions
![Page 32: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/32.jpg)
Modeling the relationship between the weather and Citi Bike ridership
![Page 33: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/33.jpg)
Modeling the relationship between the weather and Citi Bike ridership
• Daily ridership data from Citi Bike
• Daily weather data from National Climatic Data Center: temperature, precipitation, snow depth
• Devise and calibrate model in R
![Page 34: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/34.jpg)
Modeling the relationship between the weather and Citi Bike ridership
![Page 35: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/35.jpg)
Model specification
![Page 36: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/36.jpg)
Calibration in R
• Uses nlsLM() function from minpack.lm package for Levenberg–Marquardt algorithm to minimize nonlinear squared error
https://gist.github.com/toddwschneider/bac3350f84b2ff99969d
![Page 37: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/37.jpg)
Model Results
![Page 38: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/38.jpg)
Airport traffic
• Question: how long will my taxi take to get to the airport?
• LGA, JFK, and EWR are each their own census tracts
• Get all trips that dropped off in one of those tracts
• Calculate travel times from neighborhoods to airports
![Page 39: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/39.jpg)
Airport traffic
![Page 40: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/40.jpg)
More fun stuff in the full posts
• On the realism of Die Hard 3
• Relationship between age, gender, and cycling speed
• Neighborhoods with most nightlife
• East Hampton privacy concerns
• What time do investment bankers arrive at work?
![Page 41: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/41.jpg)
“Medium data” analysis tips
![Page 42: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/42.jpg)
What is “medium data”?No clear answer, but my rough thinking:
• Tiny: fits in spreadsheet
• Small: doesn’t fit in spreadsheet, but fits in RAM
• Medium: too big for RAM, but fits on local hard disk
• Big: too big for local disk, has to be distributed across many nodes
![Page 43: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/43.jpg)
Use the right tool for the jobMy personal toolkit (yours may vary!):
• PostgreSQL for storing and aggregating data. Geospatial calculations with PostGIS extension
• R for modeling and plotting
• Command line tools for looping through files, loading data, text processing on input data with sed, awk, etc.
• Ruby for making API calls, scraping websites, running web servers, and sometimes using local rails apps to organize relational data
• JavaScript for interactivity on the web
![Page 44: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/44.jpg)
R + PostgresSQL• The R ↔ Postgres link is invaluable! Use R and
Postgres for the things they’re respectively best at
• Postgres: persisting data in tables, rote number crunching
• R: calibrating models, plotting
• RPostgreSQL package allows querying Postgres from within R
![Page 45: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/45.jpg)
Tip: pre-aggregate• Think about how you’re going to access the data, and
consider creating intermediate aggregated tables which can be used as building blocks for later analysis
• Example: number of taxi trips grouped by pickup census tract and date/time truncated to the hour
• Resulting table is only 30 million rows, easier to work with than full trips table, and can still answer lots of interesting questions
![Page 46: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/46.jpg)
Pre-aggregating exampleCREATE TABLE hourly_pickups AS SELECT date_trunc('hour', pickup_datetime) AS pickup_hour, cab_type_id, pickup_nyct2010_gid, COUNT(*) FROM trips WHERE pickup_nyct2010_gid IS NOT NULL GROUP BY pickup_hour, cab_type_id, pickup_nyct2010_gid;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38
![Page 47: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/47.jpg)
How to get people to read your work
• It has to be interesting. If you’re not excited, probably nobody else is either
• Most people are distracted, and they read things in “fast scroll” mode. Optimize for them
• The questions you ask are more important than the methods you use to answer them
![Page 48: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/48.jpg)
Specific tips
• Write in short paragraphs with straightforward language
• Use plenty of section headers
• Good ratio of pictures to text
• Avoid the dreaded “wall of text”
![Page 49: Analyzing NYC Transit Data](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f046b11a28ab975b8b4639/html5/thumbnails/49.jpg)
Above all…
• Have fun!
• Keep an inquisitive mind. Observe stuff happening around you, ask questions about it, try to answer those questions