2016 foss4 g track: developing and implementing spatial etl processes with open source tools by...
Post on 07-Jan-2017
37 Views
Preview:
TRANSCRIPT
Developing and Implementing Spatial ETL Processes with Open Source Tools
Matthew BakerDenver Public Schools
Dep't of Planning and Analysis
Quick FOSS4G Update
● Replaced ArcSDE with PostGIS– Dev, QA (Prod TBA...)
● 'Non-GIS' users using QGIS● Editable PostGIS Views!● Data-driven cartography
– OSM data with styles saved in PostGIS
● Manager now using QGIS
ETL Process Needs
● PostGIS read/write● SQL Server Spatial read/write● Daily Updating of Tables● Daily Building of Datasets● Daily Delivery to DPS Enterprise
Goals of ETL Development
● Break dependency on GUI-based tools● Overcome 'other' FOSS ETL Tools
– Geokettle
– GDAL
● Avoid commercial ETL tools– SSIS
– FME
Creating New Tables
● Import Shapefiles to Dev– QGIS DB Manager
● Import Non-Spatial Tables– CSVKit - Python via command line
– Read CSV Schema● Generate SQL ‘Create Table’
PostGIS Spatial Transformation
select
ST_Transform(geom, 2877)
from dpsdata."Schools_Current"
select
ST_AsText(ST_Transform(geom, 2877))
from dpsdata."Schools_Current"
Python for Databases
● Pypyodbc– Pure python implementation of pyodbc
– Connect to databases using ODBC● MS SQL Server
● Psycopg2– PostgreSQL adapter (libraries) for Python
SQL Inserts with Parameters
INSERT INTO "Schools_Current"
(school_name, abbreviation, elem, mid, high, schnum, geom classification)
VALUES (?, ?, ?, ?, ?, ?, ?);*
* postgresql syntax
Python ETL Pattern
Connect to databases
Truncate Destination
Insert into Destination
Select from Source
Python ETL Pattern● Connect to databases
● Source● Destination
● Set Up Cursors
● Select from Source● Use SQL Expression (with spatial function)● Assign data to rows (in memory)
● Insert into Destination● Create insert statement with parameters● Iterate through rows (data)
● Assign row values to variables● Commit data with Insert
● Truncate Destination
Example: PostGIS to PostGISimport psycopg2
connSource = psycopg2.connect("host=arcgisdev01 dbname=dpspgisdev user=dpsdata password=*** ")curSource = connSource.cursor()connDest = psycopg2.connect("host=FOSS4GLin01 dbname=dpspgisqa user=dpsdata password=*** ")curDest = connDest.cursor()
curSource.execute('''select addressid, cast(geom as varchar) from public."Address_Master"''')
sql = ('''insert into dqmt.Address_Master (addressid, geom) values (%s, %s)''')
data = []
rows = curSource.fetchall()
for row in rows: data = [row[0], row[1]] curDest.execute (sql, data) connDest.commit()
connSource.close()connDest.close()
Deployed Processes
● Daily Active Students – Extract from MSSQL View joining geometry to students
– Deliver to PostGIS and MSSQL
● Refresh Boundaries– PostGIS Materialized Views
● Geocoding● Enterprise Delivery
– Schools and Boundaries
– Shared Enrollment Zone Info
– Current Addresses and Boundary Information (spatial join)
Ubuntu Server Deployment● Cron Task Scheduler
0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries_All.py
0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries...
0 3 * * * python /home/dpspgisqa/scripts/Schools_Current.py
0 3 * * * python /home/dpspgisqa/scripts/Schools_Projected.py
* * * * * /folder/runThisFile.py
| | | | |
| | | | ----- Day of week (0 - 7) (Sunday=0 or 7)
| | | ------- Month (1 - 12)
| | --------- Day of month (1 - 31)
| ----------- Hour (0 - 23)
------------- Minute (0 - 59)
Other Python Tricks
● Error Handling– On script fail
● Send Email● Insert message to database
● Run single SQL Script– Within 1 database
● Bulk Inserts
Next Steps
● Implement PostGIS Prod server– CentOS (new IT staff!!!)
● Document Internally● Share Externally
– github.com/DPSSpatial
● Web maps – Internal– External
top related