geomesa locationtech dc

25
Anthony Fox Director, Data Science and System Architecture Commonwealth Computer Research, Inc [email protected]

Upload: ccrinc

Post on 26-Jan-2015

106 views

Category:

Technology


1 download

DESCRIPTION

GeoMesa presentation from LocationTech Tour - DC - November, 14th 2013. Presented by Anthony Fox (@algoriffic) of CCRi. GeoMesa is an open source project providing spatio-temporal indexing, querying, and visualizing capabilities to Accumulo. Learn more at http://geomesa.github.io/

TRANSCRIPT

Page 1: GeoMesa LocationTech DC

Anthony Fox Director, Data Science and System Architecture Commonwealth Computer Research, Inc [email protected]

Page 2: GeoMesa LocationTech DC

What is this talk about?

Indexing, querying, visualizing, and analyzing spatio-temporal data at scale.

Using open-source.

Page 3: GeoMesa LocationTech DC

Why?

Page 4: GeoMesa LocationTech DC

Why?

●  Volume of spatio-temporal data is increasing exponentially ●  Traditional multi-dimensional indexing techniques are

straining to keep up

Page 5: GeoMesa LocationTech DC

How?

•  Storage - leverage distributed databases like Accumulo.

•  Compute - parallelize spatio-temporal queries and analytics using MapReduce.

GeoMesa enables geospatial analytics within

the Hadoop ecosystem.

Page 6: GeoMesa LocationTech DC

What is GeoMesa?

•  A flexible spatio-temporal index built on Accumulo.

•  An implementation of GeoTools interfaces to make integration seamless.

•  A set of GeoServer plugins for OGC compliant access to data.

Page 7: GeoMesa LocationTech DC

Integration

Page 8: GeoMesa LocationTech DC

What is Accumulo?

“The Accumulo sorted distributed key/value store is a robust, high performance data storage and retrieval system” http://accumulo.apache.org

Page 9: GeoMesa LocationTech DC

What is Accumulo?

“The Accumulo sorted distributed key/value store is a robust, high performance data storage and retrieval system” http://accumulo.apache.org

Based on Google BigTable Adds cell-level security and server side programming model in the form of composable iterators

Page 10: GeoMesa LocationTech DC

What is Accumulo?

“The Accumulo sorted distributed key/value store is a robust, high performance data storage and retrieval system” http://accumulo.apache.org

h"p://accumulo.apache.org/1.4/user_manual/Accumulo_Design.html    

Page 11: GeoMesa LocationTech DC

What is Accumulo?

“The Accumulo sorted distributed key/value store is a robust, high performance data storage and retrieval system” http://accumulo.apache.org

h"p://accumulo.apache.org/1.4/user_manual/Accumulo_Design.html    

Page 12: GeoMesa LocationTech DC

How Do We Store Multi-Dimensional Data in a Dictionary?

•  Space Filling Curves project multiple dimensions into a single dimension

•  Base32 encoding induces an Accumulo friendly lexicographic ordering

•  Recursive nesting facilitates storing different resolutions of data

•  GeoHashes are common in web services

http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves

Page 13: GeoMesa LocationTech DC

How Does GeoMesa’s Index Work? Constructs a key beginning with a

shard id for horizontal scalability.

Page 14: GeoMesa LocationTech DC

How Does GeoMesa’s Index Work? Constructs a key beginning with a

shard id for horizontal scalability.

Page 15: GeoMesa LocationTech DC

How Does GeoMesa’s Index Work? Constructs a key beginning with a

shard id for horizontal scalability.

Page 16: GeoMesa LocationTech DC

How Does GeoMesa’s Index Work? Constructs a key beginning with a

shard id for horizontal scalability.

Uses Space Filling Curves to encode spatio-temporal data in Accumulo keys.

Page 17: GeoMesa LocationTech DC

How Does GeoMesa’s Index Work? Constructs a key beginning with a

shard id for horizontal scalability.

Uses Space Filling Curves to encode spatio-temporal data in Accumulo keys.

Stacks server side iterators to apply (E)CQL standard queries in parallel at scan time.

Page 18: GeoMesa LocationTech DC

What is the GeoMesa Model?

Page 19: GeoMesa LocationTech DC

How Does GeoMesa Perform?

GDELT - Global Database of Events, Language, and Tone Leetaru, Kalev and Schrodt, Philip. (2013). GDELT: Global Data on Events, Language, and Tone, 1979-2012. International Studies Association Annual

Conference, April 2013. San Diego, CA. - See more at: http://gdelt.utdallas.edu/about.html

220 million geocoded events from 1979 until current. Exhibits pathologies common in spatio-temporal data sets

Hot spots Bad geocoding

Page 20: GeoMesa LocationTech DC

GDELT GDELT assigns an Event Code

to each event.

Codes are based on CAMEO - Conflict Mediation and Event Observation.

There are 20 top level CAMEO codes.

John Beieler developed a visualization of every protest (one of the top level categories) on the planet since 1979.

http://www.foreignpolicy.com/articles/2013/08/22/mapped_what_every_protest_in_the_last_34_years_looks_like

Page 21: GeoMesa LocationTech DC

GDELT

http://geomesa.github.io/gdelt.html

Page 22: GeoMesa LocationTech DC

How?

Storage, Querying, Filtering

Aggregation and analysis

Visualization

Using Open Source

Page 23: GeoMesa LocationTech DC

Distributed Spatial Computations

●  Scalding greatly simplifies Map/Reduce

●  AccumuloSource is an implementation of a Scalding source/sink

●  GeoMesa allows developers to work with SimpleFeatures in a Map/Reduce job

Page 24: GeoMesa LocationTech DC

Performance

PostGIS 1000 responses in > 30 seconds

GeoMesa 1000 responses in < 1 second

Page 25: GeoMesa LocationTech DC

Roadmap

•  Enhance integration with cell level security •  Build statistical index and query optimization

o  Bring Your Own Space Filling Curve o  “VACUUM ANALYZE”

•  Integrate GeoWebCache and Hadoop •  Ease developer on-ramping •  Grow community through LocationTech