large scale geo processing on hadoop

36
Large Scale Geo Processing on Hadoop Christoph Koerner Slides available

Upload: christoph-koerner

Post on 15-Apr-2017

135 views

Category:

Technology


0 download

TRANSCRIPT

About me

● Data Scientist at T-Mobile Austria● Visual Computing at Vienna University of Technology● Author of Data Visualizations with D3 and AngularJS● Author of Learning Responsive Data Visualization● Organizer of Vienna Kaggle Meetup● LinkedIn: linkedin.com/in/christophkoerner● Twitter: @ChrisiKrnr● Google+: +ChrisiHififm● Github: github.com/chaosmail

Overview

1. Introduction

2. Pre-processing

3. Large scale geo processing on Hadoop

4. Some use-cases

Introduction

What is Geo Processing?

● Operations to manipulate spatial data

● Operations include geographic feature overlay, feature selection and analysis, topology processing, raster processing, and data conversion

Source: Wikipedia

Spatial Data

● Contains data for a spatial reference

● Mostly 1D or 2D Geometries such as Points, Lines, Polygons, etc.

● Usually in latitude and longitude (or x and y) coordinates

3D or 2D

What is Latitude and Longitude?

Source: imgur.com

3D Coordinates

● Earth is not a perfect sphere!

● Can be approximated by a biaxial ellipsoid

● 3D coordinates need a reference ellipsoid

● Most widely used is the World Geodetic System (WGS84) used by GPS

● Minimal positioning error on the surface

3D or 2D

Going to 2 Dimensions

2D Projections

● The earth cannot be displayed on a 2D map without distortion

● Mapping to the surface of other 3D Volumes

○ Cylindrical

○ Conical

○ Azimuthal

2D Projections

● The earth cannot be displayed on a 2D map without distortion

● Every mapping has its tradeoff

○ Length Preserving (Equidistant)

○ Area Preserving (Equal Area)

○ Angle Preserving (Conformal)

2D Projections

● Commonly used in Austria: MGI Austria Lambert (equal area)

● Commonly used in the US: Albers USA projection (equal area)

3D or 2D

Pre-processing

Geo Processing on Hadoop

● Acquire spatial dataset

● Pre-process the dataset

● Load the dataset into HDFS

● Perform topological processing and analysis using Hive

● Visualize the results

Data Sources

● https://www.data.gv.atFree Austrian data, demographics, health, tourism, public transport, etc.

● GIP Graphenintegrations-PlattformAustrian traffic graph, public transport, streets, etc.

● GADM Database of Global Administrative AreasCountry shapes

● Many more..

GDAL - Geospatial Data Abstraction Library

● Translator library for raster and vector geospatial data formats

● Converts spatial data between file formats, reference systems and projections

● SQL query syntax

● Command-line tool (MIT license)

Source: gdal.org

GDAL - Transform Shapefiles to CSV

ogr2ogr -f CSV output.csv \ input.shp \-lco GEOMETRY=AS_WKT \-lco SEPARATOR=SEMICOLON \-oo ENCODING=UTF-8

GDAL - Use spatial queries

ogr2ogr -sql "SELECT A.* FROM shape1 A, shape2 B WHERE ST_Intersects(A.geo, B.geo)" \

-dialect SQLITE \data input_dir \-nln output.shp

More complex pre-processing

● Fiona for loading Shapefiles

● Shapely for geo processing

● Complex pre-processing, extraction, transformations, area and length computations, etc.

ESRI Tools on Hadoop

Source: ESRI Github

ESRI Tools on Hadoop

● Open Source Tools from ESRI

● Provided under Apache-License 2.0

● Geo processing tools for Hadoop + ArcGis

● Active development and feedback (on Github)

ESRI Tools on Hadoop

● Esri Geometry API for JavaJava library for geo processing

● Spatial Framework for HadoopHive SerDe and UDFs based on the geometry API

● Geoprocessing Tools for HadoopTools for data exchange between ArcGis and HDFS

● GIS Tools for HadoopSample application and demos

Spatial Framework for Hadoop

ADD JAR libs/esri-geometry-api.jar;ADD JAR libs/spatial-sdk-hadoop.jar;

CREATE TEMPORARY FUNCTION ST_Point AS ‘com.esri.hadoop.hive.ST_Point’;CREATE TEMPORARY FUNCTION ST_LineString AS ‘com.esri.hadoop.hive.ST_LineString’;CREATE TEMPORARY FUNCTION ST_Polygon AS ‘com.esri.hadoop.hive.ST_Polygon’;

Spatial Framework for Hadoop

SELECT ST_Area(ST_Polygon(0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0)

) FROM src LIMIT 1;

SELECT ST_AsText(ST_Centroid(

ST_GeomFromText(geo_as_wkt))

) FROM src LIMIT 1;

Problems

● No persistent spatial indices

● No projections - length/area!

● Binary output by default

● Doesn’t work with vectorization

● No visualization

● Not feature complete (but most things work)

Use Cases for Geo Processing

Use Case: Geo Processing @ T-Mobile Austria

● Network traffic analysis and optimization

● Signal performance analog railway tracks

● Better analysis of network coverage

● Many more..

Use Case: Trips Analysis @ Uber

● What do trips look like?

● How can we reduce wait time and make more trips?

● Are there new products we should introduce?

Source: slideshare.net

Use Case: Traffic Jam Prediction based on GPS/FCD

● Estimate average speed of cars on road

● Compare to the max speed on each street

● Use public traffic jam data as ground truth

● Train a model to predict traffic jams

Thank you.

Christoph Koerner