pycon 2012: python for data lovers: explore it, analyze it, map it

59
Python for Open Data Lovers: Explore It, Analyze It, Map It Jackie Kazil @jackiekazil Dana Bauer @geography76 Saturday, March 10, 2012

Upload: jackie-kazil

Post on 06-May-2015

1.213 views

Category:

Technology


1 download

DESCRIPTION

Slides from Pycon 2012. Speakers: Jacqueline Kazil , Dana Bauer More info: https://us.pycon.org/2012/schedule/presentation/426/ | Video: http://pyvideo.org/video/676/python-for-data-lovers-explore-it-analyze-it-m

TRANSCRIPT

Page 1: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Python for Open Data Lovers: Explore It, Analyze It, Map It

Jackie Kazil@jackiekazil

Dana Bauer@geography76

Saturday, March 10, 2012

Page 2: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 3: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 4: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 5: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 6: PyCon 2012: Python for data lovers: explore it, analyze it, map it

• open data everywhere

• a data swiss army knife

• finding network patterns

• finding spatial patterns

• which stories to pursue? moving beyond data analysis

Where are we going?

Saturday, March 10, 2012

Page 7: PyCon 2012: Python for data lovers: explore it, analyze it, map it

• Data.gov

• OpenDataPhilly

• DC Data Catalog

• DataSF

• Chicago Data Portal

• NYC Open Data

• London Datastore

Saturday, March 10, 2012

Page 8: PyCon 2012: Python for data lovers: explore it, analyze it, map it

assembly member expensesbicycle lanes

city purchase ordersdialysis centerselevation data

filming locationsGoogle Transit Feed Specification (GTFS)

historical photosinfluenza ratesjudicial districts

Key Stage 2 test results by free school meal eligibilityland cover

monthly calls to Human Services Agency switchboard operatorsneighborhood health clinicsOyster ticket stop locations

political districtsquality of life indicatorsrestaurant inspections

sewer linestraffic counts

utility excavation and paving five-year planviolent crime incidents

ward officesyouth centers

zoning

**real-time parking availability and pricing**

Saturday, March 10, 2012

Page 9: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 10: PyCon 2012: Python for data lovers: explore it, analyze it, map it

http://bit.ly/DCdatafail

Saturday, March 10, 2012

Page 11: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 12: PyCon 2012: Python for data lovers: explore it, analyze it, map it

• What are DC agencies spending money on?

• How much are they spending?

• What are the relationships between businesses and agencies?

• Where are these businesses located?

Saturday, March 10, 2012

Page 13: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 14: PyCon 2012: Python for data lovers: explore it, analyze it, map it

swiss army knife

• csvkit: http://csvkit.readthedocs.org/

• a set of Python utilities for working with csv

• meant to replace csv module

• pip install csvkit (no issues!)

Saturday, March 10, 2012

Page 15: PyCon 2012: Python for data lovers: explore it, analyze it, map it

$ csvcut -n purchase2011_cleaned.csv 1: PO_NUMBER 2: AGENCY_NAME 3: NIGP_DESCRIPTION 4: PO_TOTAL_AMOUNT 5: ORDER_DATE 6: SUPPLIER 7: SUPPLIER_FULL_ADDRESS

! ! !

Saturday, March 10, 2012

Page 16: PyCon 2012: Python for data lovers: explore it, analyze it, map it

$ csvcut -c 2,6 purchase2011_cleaned.csv | csvstat 1. AGENCY_NAME! <type 'unicode'>! Nulls: False! Unique values: 85! 5 most frequent values:! ! DISTRICT OF COLUMBIA PUBLIC SCHOOLS:!2410! ! STATE SUPERINTENDENT OF EDUCATION (OSSE):! 1340! ! DEPARTMENT OF HEALTH:! 895! ! OFFICE OF CHIEF TECHNOLOGY OFFICER:! 786! ! OFF PUBLIC ED FACILITIES MODERNIZATION:!722! Max length: 40 2. SUPPLIER! <type 'unicode'>! Nulls: False! Unique values: 4357! 5 most frequent values:! ! OST, INC.:! 841! ! DELL COMPUTER CORP.:! 366! ! AMERICAN EXPRESS COMPANY:! 282! ! MVS, INC.:! 176! ! CAPITAL SERVICES AND SUPPLIES:! 167! Max length: 52

Row count: 16075

! ! !

Saturday, March 10, 2012

Page 17: PyCon 2012: Python for data lovers: explore it, analyze it, map it

$ csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csv

PO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLIER,SUPPLIER_FULL_ADDRESSPO352244,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,408644.73,01/04/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352652,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,111679.16,01/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352920,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2205630.13,01/11/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO355150,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,391092.49,02/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356426,STATE SUPERINTENDENT OF EDUCATION (OSSE),FINANCIAL SERVICES (NOT OTHERWISE CLASSIFIED) 49,999891,02/23/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356632,STATE SUPERINTENDENT OF EDUCATION (OSSE),PROFESSIONAL SERVICES (NOT OTHERWISE CLASSIFIED) 58,187200,02/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO359961,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,1753238,04/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO360284,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,110729.88,04/14/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361203,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,92617.32,04/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO351462-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATIONAL RESEARCH SERVICES 19,152229.95,05/05/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO364208,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,118825.51,06/09/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO366839,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2767027,07/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO365094-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,98092.35,08/15/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO370948,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,45736.58,08/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361027-V5,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,29424.86,09/06/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO374132,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,9000,09/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO377919,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,491663.6,10/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO381219,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,120188.81,11/29/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO383965,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,294690.57,12/22/2011,MAYA ANGELOU PCS,"1436 U STREET, NW SUITE 203, WASHINGTON, DC, 20009"! ! !Saturday, March 10, 2012

Page 18: PyCon 2012: Python for data lovers: explore it, analyze it, map it

$ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n 20 | csvlook------------------------------------------------------------------------------------------------------------| PO_TOTAL_AMOUNT | AGENCY_NAME | SUPPLIER | ORDER_DATE |------------------------------------------------------------------------------------------------------------| 154133337.02 | DEPARTMENT OF TRANSPORTATION | SKANSKA-FACCHINA JV | 2011-11-10 || 62677473.88 | DEPARTMENT OF REAL ESTATE SERVICES | EEC OF DC INC-FORRESTER CONSTR | 2011-09-22 || 31809425.48 | DEPARTMENT OF HEALTH | DEFENSE LOGISTIC AGENCY | 2011-09-08 || 23600580.0 | DEPARTMENT OF CORRECTIONS | UNITY HEALTH CARE, INC. | 2011-10-24 || 23538552.0 | DEPARTMENT OF REAL ESTATE SERVICES | EEC-FORRESTER ANACOSTIA | 2011-11-08 || 22375314.45 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-05-25 || 21450000.04 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-08-18 || 20813348.99 | DEPARTMENT OF REAL ESTATE SERVICES | THE JOHN AKRIDGE CO | 2011-06-28 || 20622000.0 | DEPARTMENT OF TRANSPORTATION | W M SCHLOSSER CO INC | 2011-08-29 || 19824914.0 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-10-24 || 18300956.56 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-11-29 || 18104339.98 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-05-17 || 18000000.0 | DEPARTMENT OF HEALTH | DC PRIMARY CARE ASSOCIATION | 2011-03-10 || 17000000.0 | DEPARTMENT OF HEALTH | CHILDRENS NATIONAL MEDICAL CTR | 2011-11-25 || 16850000.0 | DEPUTY MAYOR FOR ECONOMIC DEVELOPMENT | 2 M STREET REDEVELOPMENT LLC | 2011-09-29 || 16333257.33 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-06-02 || 14206937.0 | PUBLIC CHARTER SCHOOLS | FRIENDSHIP PCS | 2011-07-12 || 13862557.44 | MUNICIPAL FACILITIES: NON-CAPITAL | US SECURITY ASSOCIATES, INC. | 2011-10-07 || 13800000.0 | DISTRICT DEPARTMENT OF THE ENVIRONMENT | VERMONT ENERGY INVESTMENT CORP | 2011-10-04 |------------------------------------------------------------------------------------------------------------

! ! !

Saturday, March 10, 2012

Page 19: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Social Network Analysis

“Social network analysis is focused on uncovering the patterning of people's

interaction.” - http://www.insna.org/sna/what.html

Saturday, March 10, 2012

Page 20: PyCon 2012: Python for data lovers: explore it, analyze it, map it

President: ReaganHouse majority: DemocratsYears: 1985, 1986

99th House

Saturday, March 10, 2012

Page 21: PyCon 2012: Python for data lovers: explore it, analyze it, map it

107th House

President: BushHouse majority: RepublicansYears: 2001, 2002

Saturday, March 10, 2012

Page 22: PyCon 2012: Python for data lovers: explore it, analyze it, map it

President: BushHouse majority: RepublicansYears: 2003, 2004

108th House

Saturday, March 10, 2012

Page 23: PyCon 2012: Python for data lovers: explore it, analyze it, map it

109th House

President: BushHouse majority: RepublicansYears: 2005, 2006

Saturday, March 10, 2012

Page 24: PyCon 2012: Python for data lovers: explore it, analyze it, map it

President: BushHouse majority: DemocratsYears: 2007, 2008

110th House

Saturday, March 10, 2012

Page 25: PyCon 2012: Python for data lovers: explore it, analyze it, map it

111th House

President: ObamaHouse majority: DemocratsYears: 2009, 2010

Saturday, March 10, 2012

Page 26: PyCon 2012: Python for data lovers: explore it, analyze it, map it

CSV to network import networkx as nx

G = nx.Graph()node_edgelist = []

# grab edgesfor row in csv_file: node_edgelist.append((n,e))

# create edgesfor f in node_edgelist: for t in node_edgelist: if t != f: add_edge_or_weight(G, f[0], t[0])

Saturday, March 10, 2012

Page 27: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Centrality Analysis (networkx)Degree - nx.degree(G)# of connections; More connections = more important

Closeness centralitynx.closeness_centrality(G)Distance to all other nodes; Closer = more important

Betweenness centralitynx.betweenness_centrality(G)Based on the shortest path of info control

Page ranknx.pagerank(G)Node gains importance via the importance around him

Saturday, March 10, 2012

Page 28: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Centrality Analysis (networkx)

Saturday, March 10, 2012

Page 29: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Centrality Analysis (networkx)Digi Docs Inc Document Mangers (Dallas)“Offers software that generates loan documents for electronic delivery.”

Iron Mountain (Mountain View)“Iron Mountain provides information management services that help organizations lower the costs, risks and inefficiencies of managing their physical and digital data.”

MVS, Inc. (Washington, DC)“MVS Consulting is an 8(a) STARS II, HUBZone, LSDBE, CBE, and MBE IT Solutions company that provides IT solutions to Federal, State and Local Government Agencies.”

MDM OFFICE SYSTEMS INC (Washington, DC)"Standard Office Supply - Office Supplies, Furniture Dealer, Educational Products, Breakroom Supplies, Imaging Supplies, and Coffee Services"

Capital Services and Supplies (Washington, DC)“CSSI is an office solutions firm located in Washington, DC since 1980. CSSI’s goods and services are available to commercial, government, and educational institutions throughout the continental United States.”

Saturday, March 10, 2012

Page 30: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Centrality Analysis (networkx)

Not included in previous slide...

United States Postal Service&

Dell Computer Corp

Saturday, March 10, 2012

Page 31: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Visual the networkpos=nx.spring_layout(G,iterations=100)plot.figure(1,figsize=(15,15))plt.axis('off')

nx.draw_networkx_nodes( G, pos,node_size=100, alpha=1, node_color='g')

nx.draw_networkx_edges(G,pos,alpha=0.2)plot.savefig('graph.png')

Saturday, March 10, 2012

Page 32: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Visual the network

Saturday, March 10, 2012

Page 33: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Trimming nodes

g2 = G.copy()d = nx.degree(g2)for n in g2.nodes(): if d[n] <= degree: g2.remove_node(n) return g2

Saturday, March 10, 2012

Page 34: PyCon 2012: Python for data lovers: explore it, analyze it, map it

d=nx.degree(G)plot.figure(1,figsize=(15,10))h=plot.hist(d.values(),100)

Degree Distribution

Saturday, March 10, 2012

Page 35: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Degree Distribution

Saturday, March 10, 2012

Page 36: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Degree Distribution

Saturday, March 10, 2012

Page 37: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Trimmed nodes

Saturday, March 10, 2012

Page 38: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Adding labels

Saturday, March 10, 2012

Page 39: PyCon 2012: Python for data lovers: explore it, analyze it, map it

nx.draw_networkx_labels(g3,pos,alpha=1)nx.draw_networkx_edges(g3,pos,alpha=0.05)

Saturday, March 10, 2012

Page 40: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Maps to maps

Saturday, March 10, 2012

Page 41: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Spatial is special

• spatial data = attributes, location, time

• mappable!

• spatial data must be referenced in space

• Tobler’s First Law of Geography

Saturday, March 10, 2012

Page 42: PyCon 2012: Python for data lovers: explore it, analyze it, map it

• large data sets a smaller amount of meaningful information

• exploratory (ESDA)

• spatial statistics

• mathematical modeling and prediction of spatial processes

Spatial analysis

Saturday, March 10, 2012

Page 43: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Techniques

• point pattern analysis -- hot spots, k density, nearest neighbor

• spatial interpolation -- kriging

• spatial regression -- ordinary least squares, geographically weighted regression

Saturday, March 10, 2012

Page 44: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 45: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 46: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 47: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 48: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 49: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 50: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 51: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 52: PyCon 2012: Python for data lovers: explore it, analyze it, map it

PySAL

• GeoDa Center at ASU

• Python library for spatial analysis, with modules for exploratory spatial data analysis, spatial econometrics, and location modeling

• http://code.google.com/p/pysal/

• requires NumPy, SciPy

Saturday, March 10, 2012

Page 53: PyCon 2012: Python for data lovers: explore it, analyze it, map it

PySAL• developers looking for spatial analytical methods

to incorporate in application development

• analysts working on projects that require custom scripting

• looking for a user-friendly GUI? Try STARS, GeoDA, GeoDASpace.

• want to integrate into a powerful GIS? Look for plug-ins for ArcGIS & QGIS.

Saturday, March 10, 2012

Page 54: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Saturday, March 10, 2012

Page 55: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Next steps

• quantify clusters in city, region, nation

• examine clusters along networks, business corridors

• create beautiful, interactive maps and charts to allow users to explore spending patterns on their own

Saturday, March 10, 2012

Page 56: PyCon 2012: Python for data lovers: explore it, analyze it, map it

From data analysis to stories

Saturday, March 10, 2012

Page 57: PyCon 2012: Python for data lovers: explore it, analyze it, map it

Which stories would we go after?

• construction contracts

• funding to charter schools

• health care costs in prisons

• local vs. regional vs. national purchases

• technology services -- look for overlap

Saturday, March 10, 2012

Page 58: PyCon 2012: Python for data lovers: explore it, analyze it, map it

The SAGE Handbook of Spatial Analysiseds. A. Stewart Fotheringham and Peter A. Rogerson

Interactive Spatial Data AnalysisTrevor Bailey and Tony Gatrell

Geographic Information AnalysisDavid O’Sullivan and David Unwin

PySALLuc Anselin, GeoDA CenterArizona State University

Want to learn more?

Mia, age 3, geographer in training

Saturday, March 10, 2012