pycon 2012: python for data lovers: explore it, analyze it, map it
DESCRIPTION
Slides from Pycon 2012. Speakers: Jacqueline Kazil , Dana Bauer More info: https://us.pycon.org/2012/schedule/presentation/426/ | Video: http://pyvideo.org/video/676/python-for-data-lovers-explore-it-analyze-it-mTRANSCRIPT
Python for Open Data Lovers: Explore It, Analyze It, Map It
Jackie Kazil@jackiekazil
Dana Bauer@geography76
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
• open data everywhere
• a data swiss army knife
• finding network patterns
• finding spatial patterns
• which stories to pursue? moving beyond data analysis
Where are we going?
Saturday, March 10, 2012
• Data.gov
• OpenDataPhilly
• DC Data Catalog
• DataSF
• Chicago Data Portal
• NYC Open Data
• London Datastore
Saturday, March 10, 2012
assembly member expensesbicycle lanes
city purchase ordersdialysis centerselevation data
filming locationsGoogle Transit Feed Specification (GTFS)
historical photosinfluenza ratesjudicial districts
Key Stage 2 test results by free school meal eligibilityland cover
monthly calls to Human Services Agency switchboard operatorsneighborhood health clinicsOyster ticket stop locations
political districtsquality of life indicatorsrestaurant inspections
sewer linestraffic counts
utility excavation and paving five-year planviolent crime incidents
ward officesyouth centers
zoning
**real-time parking availability and pricing**
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
• What are DC agencies spending money on?
• How much are they spending?
• What are the relationships between businesses and agencies?
• Where are these businesses located?
Saturday, March 10, 2012
Saturday, March 10, 2012
swiss army knife
• csvkit: http://csvkit.readthedocs.org/
• a set of Python utilities for working with csv
• meant to replace csv module
• pip install csvkit (no issues!)
Saturday, March 10, 2012
$ csvcut -n purchase2011_cleaned.csv 1: PO_NUMBER 2: AGENCY_NAME 3: NIGP_DESCRIPTION 4: PO_TOTAL_AMOUNT 5: ORDER_DATE 6: SUPPLIER 7: SUPPLIER_FULL_ADDRESS
! ! !
Saturday, March 10, 2012
$ csvcut -c 2,6 purchase2011_cleaned.csv | csvstat 1. AGENCY_NAME! <type 'unicode'>! Nulls: False! Unique values: 85! 5 most frequent values:! ! DISTRICT OF COLUMBIA PUBLIC SCHOOLS:!2410! ! STATE SUPERINTENDENT OF EDUCATION (OSSE):! 1340! ! DEPARTMENT OF HEALTH:! 895! ! OFFICE OF CHIEF TECHNOLOGY OFFICER:! 786! ! OFF PUBLIC ED FACILITIES MODERNIZATION:!722! Max length: 40 2. SUPPLIER! <type 'unicode'>! Nulls: False! Unique values: 4357! 5 most frequent values:! ! OST, INC.:! 841! ! DELL COMPUTER CORP.:! 366! ! AMERICAN EXPRESS COMPANY:! 282! ! MVS, INC.:! 176! ! CAPITAL SERVICES AND SUPPLIES:! 167! Max length: 52
Row count: 16075
! ! !
Saturday, March 10, 2012
$ csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csv
PO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLIER,SUPPLIER_FULL_ADDRESSPO352244,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,408644.73,01/04/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352652,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,111679.16,01/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352920,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2205630.13,01/11/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO355150,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,391092.49,02/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356426,STATE SUPERINTENDENT OF EDUCATION (OSSE),FINANCIAL SERVICES (NOT OTHERWISE CLASSIFIED) 49,999891,02/23/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356632,STATE SUPERINTENDENT OF EDUCATION (OSSE),PROFESSIONAL SERVICES (NOT OTHERWISE CLASSIFIED) 58,187200,02/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO359961,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,1753238,04/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO360284,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,110729.88,04/14/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361203,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,92617.32,04/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO351462-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATIONAL RESEARCH SERVICES 19,152229.95,05/05/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO364208,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,118825.51,06/09/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO366839,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2767027,07/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO365094-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,98092.35,08/15/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO370948,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,45736.58,08/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361027-V5,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,29424.86,09/06/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO374132,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,9000,09/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO377919,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,491663.6,10/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO381219,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,120188.81,11/29/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO383965,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,294690.57,12/22/2011,MAYA ANGELOU PCS,"1436 U STREET, NW SUITE 203, WASHINGTON, DC, 20009"! ! !Saturday, March 10, 2012
$ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n 20 | csvlook------------------------------------------------------------------------------------------------------------| PO_TOTAL_AMOUNT | AGENCY_NAME | SUPPLIER | ORDER_DATE |------------------------------------------------------------------------------------------------------------| 154133337.02 | DEPARTMENT OF TRANSPORTATION | SKANSKA-FACCHINA JV | 2011-11-10 || 62677473.88 | DEPARTMENT OF REAL ESTATE SERVICES | EEC OF DC INC-FORRESTER CONSTR | 2011-09-22 || 31809425.48 | DEPARTMENT OF HEALTH | DEFENSE LOGISTIC AGENCY | 2011-09-08 || 23600580.0 | DEPARTMENT OF CORRECTIONS | UNITY HEALTH CARE, INC. | 2011-10-24 || 23538552.0 | DEPARTMENT OF REAL ESTATE SERVICES | EEC-FORRESTER ANACOSTIA | 2011-11-08 || 22375314.45 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-05-25 || 21450000.04 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-08-18 || 20813348.99 | DEPARTMENT OF REAL ESTATE SERVICES | THE JOHN AKRIDGE CO | 2011-06-28 || 20622000.0 | DEPARTMENT OF TRANSPORTATION | W M SCHLOSSER CO INC | 2011-08-29 || 19824914.0 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-10-24 || 18300956.56 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-11-29 || 18104339.98 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-05-17 || 18000000.0 | DEPARTMENT OF HEALTH | DC PRIMARY CARE ASSOCIATION | 2011-03-10 || 17000000.0 | DEPARTMENT OF HEALTH | CHILDRENS NATIONAL MEDICAL CTR | 2011-11-25 || 16850000.0 | DEPUTY MAYOR FOR ECONOMIC DEVELOPMENT | 2 M STREET REDEVELOPMENT LLC | 2011-09-29 || 16333257.33 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-06-02 || 14206937.0 | PUBLIC CHARTER SCHOOLS | FRIENDSHIP PCS | 2011-07-12 || 13862557.44 | MUNICIPAL FACILITIES: NON-CAPITAL | US SECURITY ASSOCIATES, INC. | 2011-10-07 || 13800000.0 | DISTRICT DEPARTMENT OF THE ENVIRONMENT | VERMONT ENERGY INVESTMENT CORP | 2011-10-04 |------------------------------------------------------------------------------------------------------------
! ! !
Saturday, March 10, 2012
Social Network Analysis
“Social network analysis is focused on uncovering the patterning of people's
interaction.” - http://www.insna.org/sna/what.html
Saturday, March 10, 2012
President: ReaganHouse majority: DemocratsYears: 1985, 1986
99th House
Saturday, March 10, 2012
107th House
President: BushHouse majority: RepublicansYears: 2001, 2002
Saturday, March 10, 2012
President: BushHouse majority: RepublicansYears: 2003, 2004
108th House
Saturday, March 10, 2012
109th House
President: BushHouse majority: RepublicansYears: 2005, 2006
Saturday, March 10, 2012
President: BushHouse majority: DemocratsYears: 2007, 2008
110th House
Saturday, March 10, 2012
111th House
President: ObamaHouse majority: DemocratsYears: 2009, 2010
Saturday, March 10, 2012
CSV to network import networkx as nx
G = nx.Graph()node_edgelist = []
# grab edgesfor row in csv_file: node_edgelist.append((n,e))
# create edgesfor f in node_edgelist: for t in node_edgelist: if t != f: add_edge_or_weight(G, f[0], t[0])
Saturday, March 10, 2012
Centrality Analysis (networkx)Degree - nx.degree(G)# of connections; More connections = more important
Closeness centralitynx.closeness_centrality(G)Distance to all other nodes; Closer = more important
Betweenness centralitynx.betweenness_centrality(G)Based on the shortest path of info control
Page ranknx.pagerank(G)Node gains importance via the importance around him
Saturday, March 10, 2012
Centrality Analysis (networkx)
Saturday, March 10, 2012
Centrality Analysis (networkx)Digi Docs Inc Document Mangers (Dallas)“Offers software that generates loan documents for electronic delivery.”
Iron Mountain (Mountain View)“Iron Mountain provides information management services that help organizations lower the costs, risks and inefficiencies of managing their physical and digital data.”
MVS, Inc. (Washington, DC)“MVS Consulting is an 8(a) STARS II, HUBZone, LSDBE, CBE, and MBE IT Solutions company that provides IT solutions to Federal, State and Local Government Agencies.”
MDM OFFICE SYSTEMS INC (Washington, DC)"Standard Office Supply - Office Supplies, Furniture Dealer, Educational Products, Breakroom Supplies, Imaging Supplies, and Coffee Services"
Capital Services and Supplies (Washington, DC)“CSSI is an office solutions firm located in Washington, DC since 1980. CSSI’s goods and services are available to commercial, government, and educational institutions throughout the continental United States.”
Saturday, March 10, 2012
Centrality Analysis (networkx)
Not included in previous slide...
United States Postal Service&
Dell Computer Corp
Saturday, March 10, 2012
Visual the networkpos=nx.spring_layout(G,iterations=100)plot.figure(1,figsize=(15,15))plt.axis('off')
nx.draw_networkx_nodes( G, pos,node_size=100, alpha=1, node_color='g')
nx.draw_networkx_edges(G,pos,alpha=0.2)plot.savefig('graph.png')
Saturday, March 10, 2012
Visual the network
Saturday, March 10, 2012
Trimming nodes
g2 = G.copy()d = nx.degree(g2)for n in g2.nodes(): if d[n] <= degree: g2.remove_node(n) return g2
Saturday, March 10, 2012
d=nx.degree(G)plot.figure(1,figsize=(15,10))h=plot.hist(d.values(),100)
Degree Distribution
Saturday, March 10, 2012
Degree Distribution
Saturday, March 10, 2012
Degree Distribution
Saturday, March 10, 2012
Trimmed nodes
Saturday, March 10, 2012
Adding labels
Saturday, March 10, 2012
nx.draw_networkx_labels(g3,pos,alpha=1)nx.draw_networkx_edges(g3,pos,alpha=0.05)
Saturday, March 10, 2012
Maps to maps
Saturday, March 10, 2012
Spatial is special
• spatial data = attributes, location, time
• mappable!
• spatial data must be referenced in space
• Tobler’s First Law of Geography
Saturday, March 10, 2012
• large data sets a smaller amount of meaningful information
• exploratory (ESDA)
• spatial statistics
• mathematical modeling and prediction of spatial processes
Spatial analysis
Saturday, March 10, 2012
Techniques
• point pattern analysis -- hot spots, k density, nearest neighbor
• spatial interpolation -- kriging
• spatial regression -- ordinary least squares, geographically weighted regression
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
PySAL
• GeoDa Center at ASU
• Python library for spatial analysis, with modules for exploratory spatial data analysis, spatial econometrics, and location modeling
• http://code.google.com/p/pysal/
• requires NumPy, SciPy
Saturday, March 10, 2012
PySAL• developers looking for spatial analytical methods
to incorporate in application development
• analysts working on projects that require custom scripting
• looking for a user-friendly GUI? Try STARS, GeoDA, GeoDASpace.
• want to integrate into a powerful GIS? Look for plug-ins for ArcGIS & QGIS.
Saturday, March 10, 2012
Saturday, March 10, 2012
Next steps
• quantify clusters in city, region, nation
• examine clusters along networks, business corridors
• create beautiful, interactive maps and charts to allow users to explore spending patterns on their own
Saturday, March 10, 2012
From data analysis to stories
Saturday, March 10, 2012
Which stories would we go after?
• construction contracts
• funding to charter schools
• health care costs in prisons
• local vs. regional vs. national purchases
• technology services -- look for overlap
Saturday, March 10, 2012
The SAGE Handbook of Spatial Analysiseds. A. Stewart Fotheringham and Peter A. Rogerson
Interactive Spatial Data AnalysisTrevor Bailey and Tony Gatrell
Geographic Information AnalysisDavid O’Sullivan and David Unwin
PySALLuc Anselin, GeoDA CenterArizona State University
Want to learn more?
Mia, age 3, geographer in training
Saturday, March 10, 2012
And even more? NetworkX tutorialhttp://networkx.lanl.gov/networkx_tutorial.pdf
UCD Dublin summer coursehttp://mlg.ucd.ie/summer
Social Network Analysis for Startups (O'Reilly Media)http://shop.oreilly.com/product/0636920020424.do
Saturday, March 10, 2012