rtree spatial indexing with mongodb - mongodc

37
WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’ JUNE 2012 @nknize +Nicholas Knize

Upload: nicholas-knize-phd-gisp

Post on 12-May-2015

7.406 views

Category:

Technology


2 download

DESCRIPTION

Presentation by Nicholas W. Knize for MongoDC describing a scalable R-Tree implementation extension for MongoDB

TRANSCRIPT

Page 1: RTree Spatial Indexing with MongoDB - MongoDC

WHY WE CHOSE MONGODB TO PUT BIG-DATA ‘ON THE MAP’

JUNE 2012

@nknize+Nicholas Knize

Page 2: RTree Spatial Indexing with MongoDB - MongoDC

“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one location…this capability allows for unprecedented situational awareness and information sharing”

-Gen. Doug Frasier

TST PRODUCTS

ACCOMPLISHING THE IMPOSSIBLE

Page 3: RTree Spatial Indexing with MongoDB - MongoDC

• Expose enterprise data in a geo-temporal user defined environment

• Provide a flexible and scalable spatial indexing framework for heterogeneous data

• Visualize spatially referenced data on 3D globe & 2D maps• Manage real-time data feeds and mobile messaging • View data over geo-rectified imagery with 3D terrain• Support mission planning and simulation• Provide real-time collaboration and sharing

ISPATIAL OVERVIEW

ACCOMPLISHING THE IMPOSSIBLE

Page 4: RTree Spatial Indexing with MongoDB - MongoDC

• Horizontally scalable – Large volume / elastic

• Vertically scalable – Heterogeneous data types (“Data Stack”)

• Smartly Distributed – Reduce the distance bits must travel

• Fault Tolerant – Replication Strategy and Consistency model

• High Availability – Node recovery

• Fast – Reads or writes (can’t always have both)BIG DATA STORAGE CHARACTERISTICS

ACCOMPLISHING THE IMPOSSIBLE

Desired Data Store Characteristic for ‘Big Data’

Page 5: RTree Spatial Indexing with MongoDB - MongoDC

• Cassandra– Nice Bring Your Own Index (BYOI) design– … but Java, Java, Java… Memory management can be an issue– Adding new nodes can be a pain (Token Changes, nodetool)– Key-Value store…good for simple data models

• Hbase– Nice BigTable model– Theory grounded heavily in C.A.P, inflexible trade-offs– Complicated setup and maintenance

• CouchDB– Provides some GeoSpatial functionality (Currently being rewritten)– HEAVILY dependent on Map-Reduce model (complicated design)– Erlang based – poor multi-threaded heap management

NOSQL OPTIONS

ACCOMPLISHING THE IMPOSSIBLE

Subset of Evaluated NoSQL Options

Page 6: RTree Spatial Indexing with MongoDB - MongoDC

Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven!

• C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging

• Disk storage is memory mapped, enabling fast swapping when necessary

• Built in auto-failover with replica sets and fast recovery with journaling

• Tunable Consistency – Consistency defined at application layer

• Schema Flexible – friendly properties of SQL enable easy port

• Provided initial spatial indexing support – Point based limited!WHY TST LIKES MONGODB

ACCOMPLISHING THE IMPOSSIBLE

Page 7: RTree Spatial Indexing with MongoDB - MongoDC

MONGODB SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE

... The Spatial Indexer wasn’t quite right

• MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time– Great for indexing numerical and text documents (1D attribute data)– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY

FRIENDLY

Page 8: RTree Spatial Indexing with MongoDB - MongoDC

DIMENSIONALITY REDUCTION

ACCOMPLISHING THE IMPOSSIBLE

How does MongoDB solve the dimensionality problem?

• Space Filling (Z) Curve – A continuous line that

intersects every point in a two-dimensional plane

• Use Geohash to represent lat/lon values– Interleave the bits of a

lat/long pair– Base32 encode the result

Page 9: RTree Spatial Indexing with MongoDB - MongoDC

GEOHASH BTREE ISSUES

ACCOMPLISHING THE IMPOSSIBLE

• Neighbors aren’t so close!– Neighboring points on the

Geoid may end up on opposite ends of the plane

– Impacts search efficiency

• What about Geometry?– Doesn’t support > 2D– Mongo uses Multi-

Location documents which really just indexes multiple points that link back to a single document

Issues with the Geohash b-Tree approach

Page 10: RTree Spatial Indexing with MongoDB - MongoDC

Case 3:

Case 4:

Multi-Location Document (aka. Polygon) Search Polygon

Case 1:

Case 2:

Success!

Success!

Fail!

Fail!

Mongo Multi-location Document Clipping Issues($within search doesn’t always work w/ multi-location)

MULTI-LOCATION CLIPPING

ACCOMPLISHING THE IMPOSSIBLE

Page 11: RTree Spatial Indexing with MongoDB - MongoDC

• Constrain the system to single point searches– Multi-dimension support will be exponentially complex (won’t scale)

• Interpolate points along the edge of the shape– Multi-dimension support will be exponentially complex (won’t scale)

• Customize the spatial indexer– Selected approach

SOLUTIONS TO GEOHASH PROBLEM

ACCOMPLISHING THE IMPOSSIBLE

Potential Solutions

Page 12: RTree Spatial Indexing with MongoDB - MongoDC

CUSTOM TUNED SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE

Thermopylae Custom Tuned MongoDB for Geo

TST Leverage’s Guttman’s 1984 Research in R/R* Trees• R-Trees organize any-dimensional data by representing

the data as a minimum bounding box. • Each node bounds it’s children. A node can have many

objects in it (max: m min: ceil(m/2) )• Splits and merges optimized by minimizing overlaps• The leaves point to the actual objects (stored on disk

probably)• Height balanced – search is always O(log n)

Page 13: RTree Spatial Indexing with MongoDB - MongoDC

Spatial Indexing at Scale with R-Trees

RTREE THEORY

ACCOMPLISHING THE IMPOSSIBLE

Spatial data represented as minimum bounding rectangles (2-dimension), cubes (3-dimension), hexadecant (4-dimension)

Index represented as: <I, DiskLoc> where:

I = (I0, I1, … In) : n = number of dimensionsEach I is a set in the form of [min,max] describing MBR range along a

dimension

Page 14: RTree Spatial Indexing with MongoDB - MongoDC

R*-Tree Spatial Index Example• Sample insertion result for 4th order

tree• Objectives:

1. Minimize area2. Minimize overlaps3. Minimize margins4. Maximize inner node utilization

a b cd e f g h i j k l

m n o p

R*-TREE INDEX OBJECTIVES

ACCOMPLISHING THE IMPOSSIBLE

Page 15: RTree Spatial Indexing with MongoDB - MongoDC

Insert

• Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded.– Which leaf to insert into?– How to split a node?

R*-TREE INSERT EXAMPLE

ACCOMPLISHING THE IMPOSSIBLE

Page 16: RTree Spatial Indexing with MongoDB - MongoDC

Insert—Leaf Selection

• Follow a path from root to leaf.• At each node move into subtree whose MBR area

increases least with addition of new rectangle.

mn

o p

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 17: RTree Spatial Indexing with MongoDB - MongoDC

Insert—Leaf Selection

• Insert into m.

m

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 18: RTree Spatial Indexing with MongoDB - MongoDC

Insert—Leaf Selection

• Insert into n.

n

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 19: RTree Spatial Indexing with MongoDB - MongoDC

Insert—Leaf Selection

• Insert into o.

o

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 20: RTree Spatial Indexing with MongoDB - MongoDC

Insert—Leaf Selection

• Insert into p.

p

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 21: RTree Spatial Indexing with MongoDB - MongoDC

mn

o p

aa

a

x

a b cd e f g h i j k l

m n o p

Query• Start at root• Find all overlapping MBRs• Search subtrees recursively

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 22: RTree Spatial Indexing with MongoDB - MongoDC

Query

• Search m.

mn

o p

a

a

x x

a b cd e f g h i j k l

m n o p

a

aa

b

cd

e

g

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 23: RTree Spatial Indexing with MongoDB - MongoDC

R*-Tree Leverages B-Tree Base Data Structures (buckets)

R*-TREE MONGODB IMPLEMENTATION

ACCOMPLISHING THE IMPOSSIBLE

Page 24: RTree Spatial Indexing with MongoDB - MongoDC

Geo-Sharding – (in work)Scalable Distributed R* Tree (SD-r*Tree)

“Balanced” binary tree, with nodes distributed on a set of servers:

• Each internal node has exactly two children

• Each leaf node stores a subset of the indexed dataset

• At each node, the height of the subtrees differ by at most one

• mongos “routing” node maintains binary tree

GEO-SHARDING

ACCOMPLISHING THE IMPOSSIBLE

Page 25: RTree Spatial Indexing with MongoDB - MongoDC

d0 d1

r1d0Data Node Spatial

Coverage

a a

b

c

cb d0

r1

a

b

c

c

b

d2d1

ed

d

r2

e

SD-r*Tree Data Structure Illustration

• di = Data Node (Chunk)• ri = Coverage Node

Leveraged work from Litwin, Mouza, Rigaux 2007

SD-r*Tree DATA STRUCTURE

ACCOMPLISHING THE IMPOSSIBLE

Page 26: RTree Spatial Indexing with MongoDB - MongoDC

SD-r*Tree Structure Distribution

d0

r1

a

b

c

c

b

d2d1

ed

d

r2

e

r2

d1 d2

d0

r1

GeoShard 2 GeoShard 3

GeoShard 1

mongos

SD-r*TREE STRUCTURE DISTRIBUTION

ACCOMPLISHING THE IMPOSSIBLE

Page 27: RTree Spatial Indexing with MongoDB - MongoDC

GeoSharding Alternative – 3D / 4D Hilbert Scanning Order

GEO-SHARDING ALTERNATIVE

ACCOMPLISHING THE IMPOSSIBLE

Page 28: RTree Spatial Indexing with MongoDB - MongoDC

Next Steps: Beyond 4-Dimensions - X-Tree(Berchtold, Keim, Kriegel – 1996)

Normal Internal Nodes Supernodes Data Nodes

• Avoid MBR overlaps

• Avoid node splits (main cause for high overlap)

• Introduce new node structure: Supernodes – Large Directory nodes of variable size

BEYOND 4-DIMENSIONS

ACCOMPLISHING THE IMPOSSIBLE

Page 29: RTree Spatial Indexing with MongoDB - MongoDC

X-TREE PERFORMANCE

ACCOMPLISHING THE IMPOSSIBLE

X-Tree Performance Results(Berchtold, Keim, Kriegel – 1996)

Page 30: RTree Spatial Indexing with MongoDB - MongoDC

T-Sciences Custom Tuned Spatial Indexer

• Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes

• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched

• Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full

• Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning

CONCLUSION

ACCOMPLISHING THE IMPOSSIBLE

Page 31: RTree Spatial Indexing with MongoDB - MongoDC

Example Use Case – OSINT (Foursquare Data)

• Sample Foursquare data set mashed with Government Intel Data (poly reports)

• 100 million Geo Document test (3D points and polys)

• 4 server replica set

• ~350ms query response

• ~300% improvement over PostGIS

EXAMPLE

ACCOMPLISHING THE IMPOSSIBLE

Page 32: RTree Spatial Indexing with MongoDB - MongoDC

Community Support

• Thermopylae contributes fixes to the codebase– http://github.com/mongodb

• TST will work with 10gen to fold into the baseline

• Active developer collaboration– IRC: #mongodb freenode.net

FIND US

ACCOMPLISHING THE IMPOSSIBLE

Page 33: RTree Spatial Indexing with MongoDB - MongoDC

THANK YOUQuestions?

Nicholas [email protected]

THANK YOU

ACCOMPLISHING THE IMPOSSIBLE

Page 34: RTree Spatial Indexing with MongoDB - MongoDC

Backup

Page 35: RTree Spatial Indexing with MongoDB - MongoDC

Thermopylae Sciences & Technology – Who are we?

• Advanced technology w/ 160+ employees• Core customers in national security, venues and

events, military and police, and city planning• Partnered with Google and imagery providers• Long term relationship focused – TS/SCI Staff TST + 10gen + Google = Game-changing approach

WHO ARE THESE GUYS?

ACCOMPLISHING THE IMPOSSIBLE

ENTERPRISEPARTNER

Page 36: RTree Spatial Indexing with MongoDB - MongoDC

Key Customers - Government• US Dept of State Bureau of Diplomatic Security

– Build and support 30 TB Google Earth Globe with multi-terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework.

• US Army Intelligence Security Command– Provide expertise in managing technology integration – prime

contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework.

• US Southern Command– Coordinate Intelligence management systems spatial data collection,

indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest.

– Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard)

GOVERNMENT CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Page 37: RTree Spatial Indexing with MongoDB - MongoDC

COMMERCIAL CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Key Customers - Commercial

ClevelandCavaliers

USGIF Las VegasMotor Speedway

BaltimoreGrand Prix

iSpatial framework serves thousands of mobile devices