2016 foss4 g track: developing and implementing spatial etl processes with open source tools by...

24
Developing and Implementing Spatial ETL Processes with Open Source Tools Matthew Baker Denver Public Schools Dep't of Planning and Analysis

Upload: gis-in-the-rockies

Post on 07-Jan-2017

37 views

Category:

Technology


0 download

TRANSCRIPT

Developing and Implementing Spatial ETL Processes with Open Source Tools

Matthew BakerDenver Public Schools

Dep't of Planning and Analysis

Quick FOSS4G Update

● Replaced ArcSDE with PostGIS– Dev, QA (Prod TBA...)

● 'Non-GIS' users using QGIS● Editable PostGIS Views!● Data-driven cartography

– OSM data with styles saved in PostGIS

● Manager now using QGIS

SQL Server

Dev QA Prod

PostGIS

ArcGIS

Enterprise

maps.dpsk12.org

Database Structure

ETL: ExtractTransformLoad

ETL Process Needs

● PostGIS read/write● SQL Server Spatial read/write● Daily Updating of Tables● Daily Building of Datasets● Daily Delivery to DPS Enterprise

Goals of ETL Development

● Break dependency on GUI-based tools● Overcome 'other' FOSS ETL Tools

– Geokettle

– GDAL

● Avoid commercial ETL tools– SSIS

– FME

Creating New Tables

● Import Shapefiles to Dev– QGIS DB Manager

● Import Non-Spatial Tables– CSVKit - Python via command line

– Read CSV Schema● Generate SQL ‘Create Table’

PostGIS and Spatial “Text”

select

geom

from dpsdata."Schools_Current"

PostGIS and Spatial “Text”

select

cast(geom as varchar)

from dpsdata."Schools_Current"

PostGIS and Spatial “Text”

select

ST_AsText(geom)

from dpsdata."Schools_Current"

PostGIS Spatial Transformation

select

ST_Transform(geom, 2877)

from dpsdata."Schools_Current"

select

ST_AsText(ST_Transform(geom, 2877))

from dpsdata."Schools_Current"

Python for Databases

● Pypyodbc– Pure python implementation of pyodbc

– Connect to databases using ODBC● MS SQL Server

● Psycopg2– PostgreSQL adapter (libraries) for Python

SQL Inserts

insert

into tablename (column1, column2)

values ('value1', 'value2')

SQL Inserts with Parameters

INSERT INTO "Schools_Current"

(school_name, abbreviation, elem, mid, high, schnum, geom classification)

VALUES (?, ?, ?, ?, ?, ?, ?);*

* postgresql syntax

Python ETL Pattern

Connect to databases

Truncate Destination

Insert into Destination

Select from Source

Python ETL Pattern● Connect to databases

● Source● Destination

● Set Up Cursors

● Select from Source● Use SQL Expression (with spatial function)● Assign data to rows (in memory)

● Insert into Destination● Create insert statement with parameters● Iterate through rows (data)

● Assign row values to variables● Commit data with Insert

● Truncate Destination

Example: PostGIS to PostGISimport psycopg2

connSource = psycopg2.connect("host=arcgisdev01 dbname=dpspgisdev user=dpsdata password=*** ")curSource = connSource.cursor()connDest = psycopg2.connect("host=FOSS4GLin01 dbname=dpspgisqa user=dpsdata password=*** ")curDest = connDest.cursor()

curSource.execute('''select addressid, cast(geom as varchar) from public."Address_Master"''')

sql = ('''insert into dqmt.Address_Master (addressid, geom) values (%s, %s)''')

data = []

rows = curSource.fetchall()

for row in rows: data = [row[0], row[1]] curDest.execute (sql, data) connDest.commit()

connSource.close()connDest.close()

Deployed Processes

● Daily Active Students – Extract from MSSQL View joining geometry to students

– Deliver to PostGIS and MSSQL

● Refresh Boundaries– PostGIS Materialized Views

● Geocoding● Enterprise Delivery

– Schools and Boundaries

– Shared Enrollment Zone Info

– Current Addresses and Boundary Information (spatial join)

Deployment

● Microsoft Windows Server– Task Scheduler

– (still doesn't run FME / ArcPY scripts)

Ubuntu Server Deployment● Cron Task Scheduler

0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries_All.py

0 3 * * * python /home/dpspgisqa/scripts/SchoolBoundaries...

0 3 * * * python /home/dpspgisqa/scripts/Schools_Current.py

0 3 * * * python /home/dpspgisqa/scripts/Schools_Projected.py

* * * * * /folder/runThisFile.py

| | | | |

| | | | ----- Day of week (0 - 7) (Sunday=0 or 7)

| | | ------- Month (1 - 12)

| | --------- Day of month (1 - 31)

| ----------- Hour (0 - 23)

------------- Minute (0 - 59)

Other Python Tricks

● Error Handling– On script fail

● Send Email● Insert message to database

● Run single SQL Script– Within 1 database

● Bulk Inserts

Next Steps

● Implement PostGIS Prod server– CentOS (new IT staff!!!)

● Document Internally● Share Externally

– github.com/DPSSpatial

● Web maps – Internal– External

THANK YOU!

[email protected]

github.com/dpsspatial