magellen: geospatial analytics on spark by ram sriharsha

17
Page 1 Magellan: Geospatial Analytics on Spark Ram Sriharsha Spark and Data Science Architect, Hortonworks

Upload: spark-summit

Post on 14-Apr-2017

2.660 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 1

Magellan: Geospatial Analytics on SparkRam Sriharsha

Spark and Data Science Architect, Hortonworks

Page 2: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 2

What is geospatial context?

•Given a point = (-122.412651, 37.777748) whichcity is it in?

•Does shape X intersect shape Y? –Compute the intersection

•Given a sequence of points and a system of roads–Compute best path representing points

Page 3: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 3

Geospatial context is useful

What neighborhoods do people go to on weekends?Predict the drop off neighborhood of a user?Predict the location where next pick up can be expected?How does usage pattern change with time?

Identify crime hotspot neighborhoodsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood

Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes

Page 4: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 4

Geospatial data is pervasive

Page 5: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 5

Why geospatial now?

Vast mobile data + geospatial= truly big data problem !

Page 6: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 6

Do you think we need one more geospatial library?

Page 7: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 7

Ancient data formats

Page 8: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 8

Coordinate System Hell!

Mobile data = GPS coordinatesMap coordinate systems optimized for precision⇒Transform from one to another

No good transformation libraries

Page 9: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 9

Simple, intuitive, handles common

formats

Scalable

Feature rich but still

extensible

Venn Diagram of geospatial libraries?

Page 10: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 10

Feature Extractors

Language integration simplifies exploratory analytics

Q-QQ-Asimilarity

Parse + Clean Logs

Ad category mapping

Query category mapping

PolyExp(Q-A)Features

Model

ConvexSolver

Train/Test

Splittrain

Test/validation

MetricsAd Server

HDFS

Data Prep

Score Model - Real-time

DataFlowStage

Data Flow Stage - Batch

Feedback

Spatial Context

Page 11: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 11

Not all is lost!

• local computations w/ ESRI Java API• Scale out computation w/ Spark• Python + R support without compromising

performance via Pyspark , SparkR• Catalyst + Data Sources + Data Frames

= Flexibility + Simplicity + Performance• Stitch it all together + Allow extension points

=> Success!

Page 12: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 12

Magellan: a complete story for geospatial?

Create geospatial analytics applicationsfaster:

• Use your favorite language (Python/ Scala), even R• Get best in class algorithms for common spatial analytics• Write less code• Read data efficiently• Let the optimizer do the heavy lifting

Page 13: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 13

How does it work?

Custom Data Types for Shapes:

• Point, Line, PolyLine, Polygon extend Shape• Local Computations using ESRI Java API• No need for Scala -> SQL serialization

Expressions for Operators:

• Literals e.g point(-122.4, 37.6)• Boolean Expressions e.g Intersects, Contains• Binary Expressions e.g Intersection

Custom Data Sources:

• Schema = [point, polyline, polygon, metadata]• Metadata = Map[String, String]• GeoJSON and Shapefile implementations

Custom Strategies for Spatial Join:

• Broadcast Cartesian Join• Geohash Join (in progress)• Plug into Catalyst as experimental strategies

Page 14: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 14

Magellan in a nutshell

• Read Shapefiles/ GeoJSON as DataSources:– sqlContext.read("magellan").load(“$path”)– sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)

• Spatial Queries using Expressions–point(-122.5, 37.6) = Shape Literal

–$”point” within $”polygon” = Boolean Expression–$”polygon1” intersection $”polygon2” = Binary Expression

• Joins using Catalyst + Spatial Optimizations–points.join(polygons).where($”point” within $”polygon”)

Page 15: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 15

Where are we at?

Magellan 1.0.3 is out on Spark Packages, go give it a try!:

• Scala support, Python support will be functional in 1.0.4 (needs Spark 1.5)• Github: https://github.com/harsha2010/magellan• Spark Packages: http://spark-packages.org/package/harsha2010/magellan• Data Formats: ESRI Shapefile + metadata, GeoJSON• Operators: Intersects, Contains, Within, Intersection• Joins: Broadcast• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/• Zeppelin Notebook Example: http://bit.ly/1kvwGjC

Page 16: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 16

What is next?

Magellan 1.0.4 expected release December:

• Python support • MultiPolygon (Polygon Collection), MultiLineString (PolyLine Collection)• Spark 1.5, 1.6• Spatial Join Optimization• Map Matching Algorithms• More Operators based on requirements • Support for other common geospatial data formats (WKT, others?)

Page 17: Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Page 17

DemoReading Geospatial FormatsUber queries