2014 11 lucene spatial temporal update

23

Upload: david-smiley

Post on 02-Jul-2015

1.191 views

Category:

Software


1 download

DESCRIPTION

About spatial/geospatial support within Lucene/Solr recently; it affects ElasticSearch too.

TRANSCRIPT

Page 1: 2014 11 lucene spatial temporal update
Page 2: 2014 11 lucene spatial temporal update

The Latest in

Spatial & Temporal SearchDavid Smiley

Page 3: 2014 11 lucene spatial temporal update

Agenda

Spatial

• Polygons and Accuracy: SerializedDVStrategy

• FlexPrefixTree

• BBoxSpatialStrategy

• Student/Intern contributions, Geodesics

Temporal

• Dates, and Date Ranges

• Search

• Faceting

Page 4: 2014 11 lucene spatial temporal update

About David Smiley

• Freelance search consultant / developer

• Expert Lucene/Solr development skills,advice (consulting), training

• Java (full-stack), Web, Spatial

• Apache Lucene / Solr committer & PMC,

Eclipse Locationtech PMC

• Authored 1st book on Solr, plus two editions

• Presented at several conferences & meetups

• Taught several Solr classes, self-developed & LucidWorks

Page 5: 2014 11 lucene spatial temporal update

Lucene Spatial Overview

• Multiple approaches to index spatial data

abstract class SpatialStrategy

(5+ concrete implementations)

• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile

• Grid based

• Uses Spatial4j lib for shapes, distance calculations, and WKT

• Uses JTS Topology Suite lib for polygons

Shape

SpatialPrefixTree / Cell PrefixTreeStrategy

IntersectsPrefixTreeFilterContains…Within…Geohash | Quad

Page 6: 2014 11 lucene spatial temporal update

SpatialPrefixTrees and Accuracy

RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree

• Thus represents shapes as grid cells of varying precision by prefix

Example, a point shape:

• D, DR, DRT, DRT2, DRT2Y

Example, a polygon shape:

• Too many to list… 508 cells

More details here:http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/

Page 7: 2014 11 lucene spatial temporal update

…continued

• For more accuracy, index more levels (longer prefixes)

• Points: linear relationship of levels to number of cells

• Non-points: exponential relationship…

RPT applies a distErrPct shape size ratio to non-point shapes to

trade accuracy for scalability

• distErrPct=0.025 (2.5% of the radius, the default):

• Massachusetts: level 6

• USA: level 4 (not as precise)

Page 8: 2014 11 lucene spatial temporal update

SerializedDVStrategy (Lucene 4.7)

• Stores serialized geometry into Lucene BinaryDocValues

• It’s as accurate as the underlying geometry coordinates/shape

• But it’s not a spatial index – it’s retrievable on a per-document basis

• Use RPT + SerializedDV for speed and accuracy!

• More to come eventually:

• Solr adapter – SOLR-5728, ElasticSearch adapter #2361

• Speed: Skip the serialized geometry check for non-edge cells –LUCENE-5579

Page 9: 2014 11 lucene spatial temporal update

SpatialArgs args = new SpatialArgs(INTERSECTS, point);

treeStrategy = new RecursivePrefixTreeStrategy(grid, "geometry");

verifyStrategy = new SerializedDVStrategy(ctx, "serialized_geometry");

Query treeQuery = new ConstantScoreQuery(treeStrategy.makeFilter(args));

Query combinedQuery = new FilteredQuery(treeQuery,verifyStrategy.makeFilter(args),FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);

Code is from a related presentation by the Climate Corporation presented at FOSS4G 2014

Sample Code

Page 10: 2014 11 lucene spatial temporal update

FlexPrefixTree (Coming to Lucene 5)

• A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !

• LUCENE-4922; Still needs to be committed. Goal is for 5.0.

• More optimized, more flexible, than Geohash & Quad

• Configurable sub-cells at each level: 4, 16, 64, 256

• You choose trade-off between index speed/disk size & search speed

• Internally uses an integer coordinate system

• Rectangle searches are particularly fast; minimal floating-point conversion

• Cells are always squares (equal sides) – better for heatmaps

• YMMV: 10% - 100% faster than GeohashPrefixTree

Page 11: 2014 11 lucene spatial temporal update

BBoxSpatialStrategy (Lucene 4.10)

• Rectangles (BBox’s) only, one value per field

• Wide predicate support

• Equals, Intersects, Within, Contains, Disjoint

• Accurate (8-byte double floating point)

• Area overlap relevancy

• Weight search results by a combination of query shape overlap & index shape overlap ratios

• Solr BBoxField…

Page 12: 2014 11 lucene spatial temporal update

Solr BBoxField

• Schema configuration<field name="bbox" type="bbox" />

<fieldType name="bbox" class="solr.BBoxField”

geo="true" units="degrees" numberType="_bbox_coord" />

<fieldType name="_bbox_coord" class="solr.TrieDoubleField”

precisionStep="8" docValues="true" stored="false"/>

• Search with overlap ratio ordering&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))

• score can be: overlapRatio, area, area2D

Page 13: 2014 11 lucene spatial temporal update

Recent Student/Intern Contributions

• Varun Shenoy via GSOC: summer 2014

• Lucene spatial: new “FlexPrefixTree” – an optimized grid

• Rebecca Alford via F.B. Open-Academy: winter 2014

• Spatial4j: geodesic polygons

• Chris Pavlicek via F.B. Open-Academy: winter 2014

• Spatial4j: geodesic buffered lines

• Evana Gizzi, MITRE intern: winter 2014

• Spatial4j: geodesic circle polygonizer

• Liviy Ambrose, MITRE intern: fall 2013

• Lucene spatial: integrated with Lucene’s benchmark module

Page 14: 2014 11 lucene spatial temporal update

Temporal/Date Durations

or basically any numeric ranges

Page 15: 2014 11 lucene spatial temporal update

Approach: Simple Two-field

(as you might do in SQL or any system without native range types)

• A start-time & end-time field pair

• A search window (time span) becomes two range queries

• details vary by predicate (Intersects, Contains, vs. Within)

• Single-valued only

• …even though Lucene supports multi-valued fields

• Theoretically possible but would be a lot of work

• because Lucene doesn’t store “position” info for numeric fields

• because numeric range/prefix queries are position-less

Page 16: 2014 11 lucene spatial temporal update

Approach: 2D Spatial PrefixTree

• Lucene Spatial QuadPrefixTree

(2D) with RPT Strategy

• Use ‘x’ for start-time, ‘y’ for end-time

• A search window (time span)

becomes a rectangle query

• details vary by predicate (Intersects, Contains, vs. Within)

• Cool…

• But floating-point edge issues

• Only ~50 levels supported; not 64

Details: http://wiki.apache.org/solr/SpatialForTimeDurations

Page 17: 2014 11 lucene spatial temporal update

Approach: DateRangePrefixTree (Lucene 5)

• A new 1D SpatialPrefixTree: NumberRangePrefixTree

• NumberRangePrefixTree w/ DateRangePrefixTree subclass

• NR-SPT: Configurable sub-cells per level; no level limit

• Not just for ranges; instances too

• Index/Search with NumberRangePrefixTreeStrategy

• Indexing, and search predicate code (e.g. Intersects…) completely re-used

• DateRangePrefixTree

• 9 Levels: 1M years, 1K years, years, months, days, hours, minutes, seconds, millis

…continued…

Page 18: 2014 11 lucene spatial temporal update

Trade-offs of N/D-SPT

• Indexing:

• “Common” date-ranges use ~ <50 terms, but random millisecond ranges use up to ~14K terms

• All date instances (not a range) <= 9 terms

• Comparison to 2D SPT: instance or range, always 50

• Search:

• Query for “common” query ranges faster than uncommon

• Comparison to 2D SPT:

• Contains & Within predicates: overlapping values per document get coalesced, can’t be differentiated

Page 19: 2014 11 lucene spatial temporal update

Solr DateRangeField

• Configuration in schema.xml:

<field name="dateRange" type=”dateRange” />

<fieldType name="dateRange" class="solr.DateRangeField" />

• Index field data, examples:

• 2014-05-21T12:00:00.000Z (same as TrieDate)

• 2014-05-21T12 (truncated to desired precision)

• [1990 TO 1995]

• Query, examples:

• fq=dateRange:[* TO 2014-05-21]

• fq={!field f=dateRange op=Contains} [2000 TO 2014-05-21]

Page 20: 2014 11 lucene spatial temporal update

Visualizing Date Facets

• http://bl.ocks.org/mbostock/4063318

Page 21: 2014 11 lucene spatial temporal update

Date Faceting

• Option A: facet.range

• Not for indexed date-ranges

• Internally executes one query for each value & caches large bitset

• Option B: facet.interval (Solr 4.10)

• Not for indexed date-ranges

• Requires DocValues (more index data)

• Supports variable/custom intervals

• New work-in-progress option: Facet on DateRangeField

• Ranges are fixed/pre-determined (months, days, etc.)

• Optimized for thousands of ranges to count• Each value-range is only 1 term!

Page 22: 2014 11 lucene spatial temporal update

Future stuff I’m excited about

• Continuing works in-progress

• Spatial heatmaps! Coming in January 2015!

• Lucene layer & Solr adapter

• Lucene term auto-prefixing LUCENE-5879

• Brings spatial, date, numeric, indexing/search to the next level!

• More prefix-tree optimizations

• Inner vs edge leaf cell differentiation for non-point shapes

• RPT + SerializedDVStrategy; skip accuracy checks for inner cells

• Don’t index leaf cells twice

Page 23: 2014 11 lucene spatial temporal update

That’s all for now; thanks for coming!

Need Lucene/Solr guidance or custom development?

Contact me!

Email: [email protected]

LinkedIn: http://www.linkedin.com/in/davidwsmiley

G+: +DavidSmiley

Twitter: @DavidWSmileyETA: December

2014