2014 11 lucene spatial temporal update
DESCRIPTION
About spatial/geospatial support within Lucene/Solr recently; it affects ElasticSearch too.TRANSCRIPT
The Latest in
Spatial & Temporal SearchDavid Smiley
Agenda
Spatial
• Polygons and Accuracy: SerializedDVStrategy
• FlexPrefixTree
• BBoxSpatialStrategy
• Student/Intern contributions, Geodesics
Temporal
• Dates, and Date Ranges
• Search
• Faceting
About David Smiley
• Freelance search consultant / developer
• Expert Lucene/Solr development skills,advice (consulting), training
• Java (full-stack), Web, Spatial
• Apache Lucene / Solr committer & PMC,
Eclipse Locationtech PMC
• Authored 1st book on Solr, plus two editions
• Presented at several conferences & meetups
• Taught several Solr classes, self-developed & LucidWorks
Lucene Spatial Overview
• Multiple approaches to index spatial data
abstract class SpatialStrategy
(5+ concrete implementations)
• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile
• Grid based
• Uses Spatial4j lib for shapes, distance calculations, and WKT
• Uses JTS Topology Suite lib for polygons
Shape
SpatialPrefixTree / Cell PrefixTreeStrategy
IntersectsPrefixTreeFilterContains…Within…Geohash | Quad
SpatialPrefixTrees and Accuracy
RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree
• Thus represents shapes as grid cells of varying precision by prefix
Example, a point shape:
• D, DR, DRT, DRT2, DRT2Y
Example, a polygon shape:
• Too many to list… 508 cells
More details here:http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/
…continued
• For more accuracy, index more levels (longer prefixes)
• Points: linear relationship of levels to number of cells
• Non-points: exponential relationship…
RPT applies a distErrPct shape size ratio to non-point shapes to
trade accuracy for scalability
• distErrPct=0.025 (2.5% of the radius, the default):
• Massachusetts: level 6
• USA: level 4 (not as precise)
SerializedDVStrategy (Lucene 4.7)
• Stores serialized geometry into Lucene BinaryDocValues
• It’s as accurate as the underlying geometry coordinates/shape
• But it’s not a spatial index – it’s retrievable on a per-document basis
• Use RPT + SerializedDV for speed and accuracy!
• More to come eventually:
• Solr adapter – SOLR-5728, ElasticSearch adapter #2361
• Speed: Skip the serialized geometry check for non-edge cells –LUCENE-5579
SpatialArgs args = new SpatialArgs(INTERSECTS, point);
treeStrategy = new RecursivePrefixTreeStrategy(grid, "geometry");
verifyStrategy = new SerializedDVStrategy(ctx, "serialized_geometry");
Query treeQuery = new ConstantScoreQuery(treeStrategy.makeFilter(args));
Query combinedQuery = new FilteredQuery(treeQuery,verifyStrategy.makeFilter(args),FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
Code is from a related presentation by the Climate Corporation presented at FOSS4G 2014
Sample Code
FlexPrefixTree (Coming to Lucene 5)
• A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !
• LUCENE-4922; Still needs to be committed. Goal is for 5.0.
• More optimized, more flexible, than Geohash & Quad
• Configurable sub-cells at each level: 4, 16, 64, 256
• You choose trade-off between index speed/disk size & search speed
• Internally uses an integer coordinate system
• Rectangle searches are particularly fast; minimal floating-point conversion
• Cells are always squares (equal sides) – better for heatmaps
• YMMV: 10% - 100% faster than GeohashPrefixTree
BBoxSpatialStrategy (Lucene 4.10)
• Rectangles (BBox’s) only, one value per field
• Wide predicate support
• Equals, Intersects, Within, Contains, Disjoint
• Accurate (8-byte double floating point)
• Area overlap relevancy
• Weight search results by a combination of query shape overlap & index shape overlap ratios
• Solr BBoxField…
Solr BBoxField
• Schema configuration<field name="bbox" type="bbox" />
<fieldType name="bbox" class="solr.BBoxField”
geo="true" units="degrees" numberType="_bbox_coord" />
<fieldType name="_bbox_coord" class="solr.TrieDoubleField”
precisionStep="8" docValues="true" stored="false"/>
• Search with overlap ratio ordering&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))
• score can be: overlapRatio, area, area2D
Recent Student/Intern Contributions
• Varun Shenoy via GSOC: summer 2014
• Lucene spatial: new “FlexPrefixTree” – an optimized grid
• Rebecca Alford via F.B. Open-Academy: winter 2014
• Spatial4j: geodesic polygons
• Chris Pavlicek via F.B. Open-Academy: winter 2014
• Spatial4j: geodesic buffered lines
• Evana Gizzi, MITRE intern: winter 2014
• Spatial4j: geodesic circle polygonizer
• Liviy Ambrose, MITRE intern: fall 2013
• Lucene spatial: integrated with Lucene’s benchmark module
Temporal/Date Durations
or basically any numeric ranges
Approach: Simple Two-field
(as you might do in SQL or any system without native range types)
• A start-time & end-time field pair
• A search window (time span) becomes two range queries
• details vary by predicate (Intersects, Contains, vs. Within)
• Single-valued only
• …even though Lucene supports multi-valued fields
• Theoretically possible but would be a lot of work
• because Lucene doesn’t store “position” info for numeric fields
• because numeric range/prefix queries are position-less
Approach: 2D Spatial PrefixTree
• Lucene Spatial QuadPrefixTree
(2D) with RPT Strategy
• Use ‘x’ for start-time, ‘y’ for end-time
• A search window (time span)
becomes a rectangle query
• details vary by predicate (Intersects, Contains, vs. Within)
• Cool…
• But floating-point edge issues
• Only ~50 levels supported; not 64
Details: http://wiki.apache.org/solr/SpatialForTimeDurations
Approach: DateRangePrefixTree (Lucene 5)
• A new 1D SpatialPrefixTree: NumberRangePrefixTree
• NumberRangePrefixTree w/ DateRangePrefixTree subclass
• NR-SPT: Configurable sub-cells per level; no level limit
• Not just for ranges; instances too
• Index/Search with NumberRangePrefixTreeStrategy
• Indexing, and search predicate code (e.g. Intersects…) completely re-used
• DateRangePrefixTree
• 9 Levels: 1M years, 1K years, years, months, days, hours, minutes, seconds, millis
…continued…
Trade-offs of N/D-SPT
• Indexing:
• “Common” date-ranges use ~ <50 terms, but random millisecond ranges use up to ~14K terms
• All date instances (not a range) <= 9 terms
• Comparison to 2D SPT: instance or range, always 50
• Search:
• Query for “common” query ranges faster than uncommon
• Comparison to 2D SPT:
• Contains & Within predicates: overlapping values per document get coalesced, can’t be differentiated
Solr DateRangeField
• Configuration in schema.xml:
<field name="dateRange" type=”dateRange” />
<fieldType name="dateRange" class="solr.DateRangeField" />
• Index field data, examples:
• 2014-05-21T12:00:00.000Z (same as TrieDate)
• 2014-05-21T12 (truncated to desired precision)
• [1990 TO 1995]
• Query, examples:
• fq=dateRange:[* TO 2014-05-21]
• fq={!field f=dateRange op=Contains} [2000 TO 2014-05-21]
Date Faceting
• Option A: facet.range
• Not for indexed date-ranges
• Internally executes one query for each value & caches large bitset
• Option B: facet.interval (Solr 4.10)
• Not for indexed date-ranges
• Requires DocValues (more index data)
• Supports variable/custom intervals
• New work-in-progress option: Facet on DateRangeField
• Ranges are fixed/pre-determined (months, days, etc.)
• Optimized for thousands of ranges to count• Each value-range is only 1 term!
Future stuff I’m excited about
• Continuing works in-progress
• Spatial heatmaps! Coming in January 2015!
• Lucene layer & Solr adapter
• Lucene term auto-prefixing LUCENE-5879
• Brings spatial, date, numeric, indexing/search to the next level!
• More prefix-tree optimizations
• Inner vs edge leaf cell differentiation for non-point shapes
• RPT + SerializedDVStrategy; skip accuracy checks for inner cells
• Don’t index leaf cells twice
That’s all for now; thanks for coming!
Need Lucene/Solr guidance or custom development?
Contact me!
Email: [email protected]
LinkedIn: http://www.linkedin.com/in/davidwsmiley
G+: +DavidSmiley
Twitter: @DavidWSmileyETA: December
2014