data philly meetup - big (geo) data

111
Big (Geo) Data Science Robert Cheetham [email protected] @rcheetham

Upload: azavea

Post on 12-Jan-2015

700 views

Category:

Technology


8 download

DESCRIPTION

Data Philly Meetup for 2/19/2013 on geospatial data science with crime data and applications of GeoTrellis to solve challenges related to large data sets.

TRANSCRIPT

Page 1: Data Philly Meetup - Big (Geo) Data

Big (Geo) Data Science

Robert [email protected]

@rcheetham

Page 2: Data Philly Meetup - Big (Geo) Data

Web/Mobile

Geospatial

UI/UX Design

High Performance Computing

R&D

Page 3: Data Philly Meetup - Big (Geo) Data

B Corporation

• Projects w/ Social Value

• Summer of Maps

• Pro Bono Program

• Donate share of profits

Research-Driven

• 10% Research Program

• Academic Collaborations

• Open Source

Page 4: Data Philly Meetup - Big (Geo) Data

Spatial Temporal Forecasting

with Philadelphia Crime Data

Page 5: Data Philly Meetup - Big (Geo) Data

How Phila PD uses Maps

Customized Map Products

Weekly CompStat Meetings

Web Crime Analysis

Page 6: Data Philly Meetup - Big (Geo) Data

Complainant

CAD

Verizon

911

911 Operator

Radio

Dispatcher

Police Officer

District

48 Desk

INCT

Daily download

& Geocoding Routines

Incident Report

Completed by Officer District X

District Y

District Z

Maps distributed

Through Intranet,

Printing, CompStat

INCT & PARS – main database sources

over 5,000 incidents daily, over 2 million annually

PARS

Page 7: Data Philly Meetup - Big (Geo) Data

The Context

1,500,000 people

7,000 police

1,000 civilian employees

2,000,000 new incidents / year

3 crime analysts

Page 8: Data Philly Meetup - Big (Geo) Data

What we did

• Weekly Compstat• Lots of maps• Automation of map creation• Web-based systems

Page 9: Data Philly Meetup - Big (Geo) Data

… but what if we could…

Accelerate the cycle Proactively notify Automate the process

Page 10: Data Philly Meetup - Big (Geo) Data

Prototype

ArcViewVB & MapObjects

MS SQL Server

Crime Incidents

Database

Shapefiles

and

GRIDs

Process Documentation

.ini

file

Page 11: Data Philly Meetup - Big (Geo) Data
Page 12: Data Philly Meetup - Big (Geo) Data

… but there was a problem …

Page 13: Data Philly Meetup - Big (Geo) Data

…it was crap …

Page 14: Data Philly Meetup - Big (Geo) Data

… sort of.

Page 15: Data Philly Meetup - Big (Geo) Data

We needed ….

1. Better Statistics

2. Notification

3. Simplicity

Page 16: Data Philly Meetup - Big (Geo) Data
Page 17: Data Philly Meetup - Big (Geo) Data

Crime Analysis – What has happened?– Mapping (spatial / temporal densities)

– Trending

– Intelligence Dashboard

Early Warning – What is out of the ordinary?– Statistical & Threshold-based Hunches (data

mining)

– Alerting

Risk Forecasting – What is likely to happen next?– Near Repeat Pattern

– Load Forecasting

Page 18: Data Philly Meetup - Big (Geo) Data

Crime Analysis– Mapping (spatial / temporal densities)

– Trending

– Intelligence Dashboard

Early Warning– Statistical & Threshold-based Hunches (data

mining)

– Alerting

Risk Forecasting– Near Repeat Pattern

– Load Forecasting

Page 19: Data Philly Meetup - Big (Geo) Data

Crime Analysis

Page 20: Data Philly Meetup - Big (Geo) Data

Intelligence Dashboard

Page 21: Data Philly Meetup - Big (Geo) Data

Crime Analysis

Page 22: Data Philly Meetup - Big (Geo) Data

Early Warning

Page 23: Data Philly Meetup - Big (Geo) Data

Early Warning

• Geographic Early Warning System– A system to alert staff of an unusual situation in a

particular location– Ingests data sets to automatically “cook on” and only

involves staff when a statistically unusual situation is found

HunchLab

Database

Operational

Database Alerting System

Geostatistical Engine

Operational

DatabaseOperational

Databases

Page 24: Data Philly Meetup - Big (Geo) Data

Early Warning

Page 25: Data Philly Meetup - Big (Geo) Data

What is a Hunch?

• A proposed hypothesis, saved into the system, and continually tested for validity

• Incident Attribute Requirements– Location (x, y)– Time (timestamp)– Classification

• Hunch Attributes– Location (area)– Time (recent / historic periods)– Classification

• Analyses– Statistical Hunch– Threshold Hunch

Page 26: Data Philly Meetup - Big (Geo) Data

Hunch Parameters: Location

• Address & Radius• Precinct/County/Country• Custom Drawn Area• Mass Hunch

Page 27: Data Philly Meetup - Big (Geo) Data

Hunch Parameters: Time

• Statistical Hunch– Recent Past– Historic Past

Page 28: Data Philly Meetup - Big (Geo) Data

Hunch Parameters: Classification

• Category• Time of Day• Narrative

Page 29: Data Philly Meetup - Big (Geo) Data

Hunch Helper

Page 30: Data Philly Meetup - Big (Geo) Data

Email Alert

Page 31: Data Philly Meetup - Big (Geo) Data

Hunch Details

Page 32: Data Philly Meetup - Big (Geo) Data

Risk Forecasting

Page 33: Data Philly Meetup - Big (Geo) Data

Predictive Analytics?

• Prediction vs. Forecasting

Page 34: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

Page 35: Data Philly Meetup - Big (Geo) Data

Contagious Crime?

• Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”

Page 36: Data Philly Meetup - Big (Geo) Data

What Do We Mean By Near Repeat?

• Repeat victimization– Incident at the same location at a later time (likely

related)• Near repeat victimization

– Incident at a nearby location at a later time (likely related)

• Incident A (place, time) --> Incident B (place, time)

Page 37: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

• The goal:– Quantify short term risk due to near-repeat victimization

• “If one burglary occurs, how does the risk of burglary for the neighbors change?”

• What we know:– Incident A (place, time) --> Incident B (place, time)

• Distance between A and B• Timeframe between A and B

• What we need to know:– What distances/timeframes are not simply random?

Page 38: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

• The process– Observe the pattern in historic data– Simulate the pattern in randomized historic data– Compare the observed pattern to the simulated patterns– Apply the non-random pattern to new incidents

• An example– 180 days of burglaries in Division 6 of Philadelphia

Page 39: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

Page 40: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

Page 41: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

Page 42: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

Page 43: Data Philly Meetup - Big (Geo) Data

Near Repeat Pattern Analysis

• How can you test your own data?– Near Repeat Calculator

• http://www.temple.edu/cj/misc/nr/

• Papers– Near-Repeat Patterns in Philadelphia Shootings (2008)

• One city block & two weeks after one shooting– 33% increase in likelihood of a second event

Jerry Ratcliffe

Temple University

Page 44: Data Philly Meetup - Big (Geo) Data

Contagious Crime?

Page 45: Data Philly Meetup - Big (Geo) Data

Workload Forecasting

Page 46: Data Philly Meetup - Big (Geo) Data

Improving CompStat

• Workload forecasting• “Given the time of year, day of week, time of day and

general trend, what counts of crimes should I expect?”

Page 47: Data Philly Meetup - Big (Geo) Data

What Do We Mean By Load Forecasting?

• Workload forecasting• Generating aggregate crime counts for a future timeframe

using cyclical time series analysis

Measure cyclical patterns

Identify non-cyclical trend

Forecast expected count

+

bit.ly/gorrcrimeforecastingpaper

Page 48: Data Philly Meetup - Big (Geo) Data

Load Forecasting

• Measure cyclical patterns• Take historic incidents (for example: last five years)• Generate multiplicative seasonal indices

– For each time cycle:» time of year» day of week» time of day

– Count incidents within each time unit (for example: Monday)– Calculate average per time unit if incidents were evenly

distributed– Divide counts within each time unit by the calculated average

to generate multiplicative indices» Index ~ 1 means at the average» Index > 1 means above average» Index < 1 means below average

Page 49: Data Philly Meetup - Big (Geo) Data

Load Forecasting

Page 50: Data Philly Meetup - Big (Geo) Data

Load Forecasting

Page 51: Data Philly Meetup - Big (Geo) Data

Load Forecasting

Page 52: Data Philly Meetup - Big (Geo) Data

Load Forecasting

Page 53: Data Philly Meetup - Big (Geo) Data

Load Forecasting

• Identify non-cyclical trend• Take recent daily counts (for example: last year daily

counts)• Remove cyclical trends by dividing by indices

• Run a trending function on the new counts– Simple average

» Last X Days– Smoothing function

» Exponential smoothing» Holt’s linear exponential smoothing

Page 54: Data Philly Meetup - Big (Geo) Data

Load Forecasting

• Forecast expected count• Project trend into future timeframe

– Always flat» Simple average» Exponential smoothing

– Linear trend» Holt’s linear exponential smoothing

• Multiple by seasonal indices to reseasonalize the data

Page 55: Data Philly Meetup - Big (Geo) Data

Load Forecasting

Measure cyclical patterns

Identify non-cyclical trend

Forecast expected count

+

bit.ly/gorrcrimeforecastingpaper

Page 56: Data Philly Meetup - Big (Geo) Data

Improving CompStat

Page 57: Data Philly Meetup - Big (Geo) Data

How Do We Know It’s Accurate?

• Testing• Generated forecasting techniques(examples)

– Commonly Used» Average of last 30 days» Average of last 365 days» Last year’s count for the same time period

– Advanced Combinations» Different cyclical indices (example: day of year vs. month of year)» Different levels of geographic aggregation for indices» Different trending functions

• Scoring methodologies (examples)– Mean absolute percent error (with some enhancements)– Mean percent error– Mean squared error

• Run thousands of forecasts through testing framework• Choose the right technique in the right situation

Page 58: Data Philly Meetup - Big (Geo) Data

Ongoing Research

Page 59: Data Philly Meetup - Big (Geo) Data

Research Topics

• Risk Forecasting– Load forecasting enhancements

• Weather and special events

– Combining short and long term risk forecasts (Temple)• Socioeconomic changes in neighborhoods

– Risk Terrain Modeling (Rutgers)• Context of crime at the microplace

Page 60: Data Philly Meetup - Big (Geo) Data

Research Topics

Page 61: Data Philly Meetup - Big (Geo) Data

Research Topics

• Risk Forecasting– Offender Management

• Prioritize offenders based upon statistical models using past behaviors

• Evaluation– Automate Randomized Controlled Trials

Page 62: Data Philly Meetup - Big (Geo) Data

Data Processing for Big (Geo) Data

Page 63: Data Philly Meetup - Big (Geo) Data

A Story

Page 64: Data Philly Meetup - Big (Geo) Data

Close to Center City

Walk to Grocery Store

Nearby Restaurants

Library

Near a Park

Biking / walking distance from our work

Biking distance to fencing

somewhat important

vital

very important

nice to have

somewhat important

very important

somewhat important

Robert’s Rules of Housing

Page 65: Data Philly Meetup - Big (Geo) Data

Child Care

Local School Rankings

Farmer's Market

Car Share

Public Transit

Your factors might include…

Page 66: Data Philly Meetup - Big (Geo) Data

We stand on the shoulders of giants

Page 67: Data Philly Meetup - Big (Geo) Data

Not a new idea … Design with Nature

Page 68: Data Philly Meetup - Big (Geo) Data

Not a new Idea … Dana Tomlin

Page 69: Data Philly Meetup - Big (Geo) Data

Desktop GIS

Page 70: Data Philly Meetup - Big (Geo) Data

x 5 x 2x 3x 1

+ ++

=

Weighted Overlay

Page 71: Data Philly Meetup - Big (Geo) Data

Geography-driven Decisions

Iterative

Individual

Web [and Mobile]

Growing data sets

Summary

Page 72: Data Philly Meetup - Big (Geo) Data

Web Challenges

Page 73: Data Philly Meetup - Big (Geo) Data

Web is different from the Desktop

Lots of simultaneous users

Stateless environment

HTML+JS+CSS

Users are less skilled

Users are less patient

Page 74: Data Philly Meetup - Big (Geo) Data

But wait … there’s a problem

10 – 60 second calculation time

Multiple simultaneous users …

… that are impatient

Page 75: Data Philly Meetup - Big (Geo) Data

Data Challenges

Page 76: Data Philly Meetup - Big (Geo) Data

Big Data – Social Media

Page 77: Data Philly Meetup - Big (Geo) Data

Big Data – Science

Page 78: Data Philly Meetup - Big (Geo) Data

Big Data – Citizen Science

Page 79: Data Philly Meetup - Big (Geo) Data

Big Data – Cities

Page 80: Data Philly Meetup - Big (Geo) Data

Early Prototype

Page 81: Data Philly Meetup - Big (Geo) Data
Page 82: Data Philly Meetup - Big (Geo) Data

Specific Optimization Goals New Raster File Structure

Distributed processing

Binary messaging protocol

Page 83: Data Philly Meetup - Big (Geo) Data

Optimization: File Format Limit data type and range

1D arrays are fast to read/write

Tiled

Pyramids

Azavea Raster Grid (ARG)

Page 84: Data Philly Meetup - Big (Geo) Data

Optimization: Distributed Processing Parallelizable - Local Ops and Focal Ops

Support multiple– Threads– Cores– CPU’s– Machines

Considered– Hadoop– Amazon Map Reduce– Beowolf

Page 85: Data Philly Meetup - Big (Geo) Data

Success!!

Reduced from 10-60 seconds to

<500 milliseconds

Page 86: Data Philly Meetup - Big (Geo) Data

Optimizing one process sub-optimizes others Complex to configure and maintain Limited to one operation No interpolation No mixing

– cell sizes– extents– projections

etc.

Page 87: Data Philly Meetup - Big (Geo) Data
Page 88: Data Philly Meetup - Big (Geo) Data

Broader set of functionality

Both raster and vector

Scala + Akka

Open source

Page 89: Data Philly Meetup - Big (Geo) Data

Faster is Different

Page 90: Data Philly Meetup - Big (Geo) Data
Page 91: Data Philly Meetup - Big (Geo) Data
Page 92: Data Philly Meetup - Big (Geo) Data
Page 93: Data Philly Meetup - Big (Geo) Data
Page 94: Data Philly Meetup - Big (Geo) Data
Page 95: Data Philly Meetup - Big (Geo) Data
Page 96: Data Philly Meetup - Big (Geo) Data
Page 97: Data Philly Meetup - Big (Geo) Data
Page 98: Data Philly Meetup - Big (Geo) Data
Page 99: Data Philly Meetup - Big (Geo) Data
Page 100: Data Philly Meetup - Big (Geo) Data
Page 101: Data Philly Meetup - Big (Geo) Data

Regional/State: 84 ms

National: 84 ms

Large Country 115 ms

Continental 271 ms

Planet 1.2 – 2.0 s

Page 102: Data Philly Meetup - Big (Geo) Data

Ongoing R&D

Page 103: Data Philly Meetup - Big (Geo) Data

GPUs

Page 104: Data Philly Meetup - Big (Geo) Data
Page 105: Data Philly Meetup - Big (Geo) Data

Re-wrote a few Map Algebra operations: Local Neighborhood Zonal Viewshed etc.

15 – 120x Large grids Large kernels

GPU Results

Page 106: Data Philly Meetup - Big (Geo) Data

Vector

Neighborhood/Focal

Spatial Statistics

Integration

New Spatial Operations

Page 107: Data Philly Meetup - Big (Geo) Data

Urban Forest Ecosystem Modeling

Page 108: Data Philly Meetup - Big (Geo) Data

Crime Analysis, Early Warning and Forecasting

Page 109: Data Philly Meetup - Big (Geo) Data

GDAL

GeoServer

PostGIS

R

GeoDa

Open Source Geoprocessing

Page 110: Data Philly Meetup - Big (Geo) Data

Many Thanks!© Photo used with permission from Alphafish, via Flickr.com

Page 111: Data Philly Meetup - Big (Geo) Data

Big (Geo) Data Science

[We are hiring]

Robert [email protected]

@rcheetham