2016 foss4 g track: developing and implementing spatial etl processes with open source tools by...
TRANSCRIPT
Developing and Implementing Spatial ETL Processes with Open Source Tools
Matthew BakerDenver Public Schools
Dep't of Planning and Analysis
Quick FOSS4G Update
● Replaced ArcSDE with PostGIS– Dev, QA (Prod TBA...)
● 'Non-GIS' users using QGIS● Editable PostGIS Views!● Data-driven cartography
– OSM data with styles saved in PostGIS
● Manager now using QGIS
ETL Process Needs
● PostGIS read/write● SQL Server Spatial read/write● Daily Updating of Tables● Daily Building of Datasets● Daily Delivery to DPS Enterprise
Goals of ETL Development
● Break dependency on GUI-based tools● Overcome 'other' FOSS ETL Tools
– Geokettle
– GDAL
● Avoid commercial ETL tools– SSIS
– FME
Creating New Tables
● Import Shapefiles to Dev– QGIS DB Manager
● Import Non-Spatial Tables– CSVKit - Python via command line
– Read CSV Schema● Generate SQL ‘Create Table’
PostGIS Spatial Transformation
select
ST_Transform(geom, 2877)
from dpsdata."Schools_Current"
select
ST_AsText(ST_Transform(geom, 2877))
from dpsdata."Schools_Current"
Python for Databases
● Pypyodbc– Pure python implementation of pyodbc
– Connect to databases using ODBC● MS SQL Server
● Psycopg2– PostgreSQL adapter (libraries) for Python
SQL Inserts with Parameters
INSERT INTO "Schools_Current"
(school_name, abbreviation, elem, mid, high, schnum, geom classification)
VALUES (?, ?, ?, ?, ?, ?, ?);*
* postgresql syntax
Python ETL Pattern
Connect to databases
Truncate Destination
Insert into Destination
Select from Source
Python ETL Pattern● Connect to databases
● Source● Destination
● Set Up Cursors
● Select from Source● Use SQL Expression (with spatial function)● Assign data to rows (in memory)
● Insert into Destination● Create insert statement with parameters● Iterate through rows (data)
● Assign row values to variables● Commit data with Insert
● Truncate Destination
Example: PostGIS to PostGISimport psycopg2
connSource = psycopg2.connect("host=arcgisdev01 dbname=dpspgisdev user=dpsdata password=*** ")curSource = connSource.cursor()connDest = psycopg2.connect("host=FOSS4GLin01 dbname=dpspgisqa user=dpsdata password=*** ")curDest = connDest.cursor()
curSource.execute('''select addressid, cast(geom as varchar) from public."Address_Master"''')
sql = ('''insert into dqmt.Address_Master (addressid, geom) values (%s, %s)''')
data = []
rows = curSource.fetchall()
for row in rows: data = [row[0], row[1]] curDest.execute (sql, data) connDest.commit()
connSource.close()connDest.close()
Deployed Processes
● Daily Active Students – Extract from MSSQL View joining geometry to students
– Deliver to PostGIS and MSSQL
● Refresh Boundaries– PostGIS Materialized Views
● Geocoding● Enterprise Delivery
– Schools and Boundaries
– Shared Enrollment Zone Info
– Current Addresses and Boundary Information (spatial join)
Ubuntu Server Deployment● Cron Task Scheduler
0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries_All.py
0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries...
0 3 * * * python /home/dpspgisqa/scripts/Schools_Current.py
0 3 * * * python /home/dpspgisqa/scripts/Schools_Projected.py
* * * * * /folder/runThisFile.py
| | | | |
| | | | ----- Day of week (0 - 7) (Sunday=0 or 7)
| | | ------- Month (1 - 12)
| | --------- Day of month (1 - 31)
| ----------- Hour (0 - 23)
------------- Minute (0 - 59)
Other Python Tricks
● Error Handling– On script fail
● Send Email● Insert message to database
● Run single SQL Script– Within 1 database
● Bulk Inserts
Next Steps
● Implement PostGIS Prod server– CentOS (new IT staff!!!)
● Document Internally● Share Externally
– github.com/DPSSpatial
● Web maps – Internal– External