graphs, graphs everywhere - lucene powered relation exploration

35
z Graphs, graphs everywhere Zbyszko Papierski, Senior Dev@JIRA Cloud, T:@ZPapierski E: [email protected] Lucene powered relation exploration

Upload: zbyszko-papierski

Post on 14-Feb-2017

42 views

Category:

Engineering


0 download

TRANSCRIPT

z

Graphs, graphs everywhere

Zbyszko Papierski, Senior Dev@JIRA Cloud, T:@ZPapierski

E: [email protected]

Lucene powered relation exploration

z

Agenda

1. Introduction to Lucene and friends 2. Evolution of data analysis by Solr and Elasticsearch 3. Graph capabilities of Elasticsearch(briefly) 4. Solr - QueryParserPlugin 5. Solr - Streaming Expressions 6. Examples

z

http://bit.do/graphs-src

z

1. Create a collection2. Put schema3. Run feeder

z

Lucene and friends

z

Lucene

Provides mechanism for fast searching of text data - both full-text search(analyzed data) and exact match(non-analyzed, or docValues)

z

Step one - indexing

{kitty|kitten|cat|cats|kittens|pussy} —> cat {is} —> {GORGEOUS!!!} —> gorgeous, pretty, nice, etc.

z

Step one - searching{very} —> very {nice} —> nice {kitty} —> cat

{nice, cat, …} {very, ugly, cat, …}{very,nice, dog, …}

{very, nice, bear, …}

z

Step one - scoring{very} —> very {nice} —> nice {kitty} —> cat

{nice, cat, …} {very, ugly, cat, …}{very,nice, dog, …}

{very, nice, bear, …}

z

Winner!

nice and cat score higher than very and nice

or very and catbecause cat is rarer than very

this is only an example, all cats are nice…

z

Solr

Older, works closer with Lucene

z

Elasticsearch

Newer, but with more toys

z

Waiter, there is a graph in my full-text search engine!

are relations

z

• full text searching • faceting/aggregation • statistical • relationship exploration

How did we get here?

z

1. Your standard, full-text search 2. TF-IDF-ish relationship sorting 3. It’s already there

z

It’s still your standard Lucene index

z

• From Elasticsearch 2.3 • REST API - /_graph/explore • visualization for Kibana • Part of elastic commercial offering (named

from 5.0 X-Pack)

Elasticsearch+Kibana

Plugin for Elasticsearch and Kibana - Graph

picture from: https://www.elastic.co/guide/en/graph/current/graph-introduction.html

z

• Available from Solr 6.0 • experimental feature • currently, works for single node, single core

applications (due to change) • no 1st party visualization • does not track edges of the traversal

Solr

built-in GraphQueryParser

picture from: http://solr.pl/2016/04/25/wizualizacja-grafow-przy-pomocy-solr-6/

z

• Available from Solr 5.5 • experimental feature

• no 1st party visualization • does track edges of the traversal and level

Solr

built-in Streaming Expressions

picture from: http://solr.pl/2016/04/25/wizualizacja-grafow-przy-pomocy-solr-6/

z

fq={!graph from=email to=friends maxDepth=2}email:"[email protected]"

z

z

z

z

z

ParamstraversalFilter

Filter query used to filter out incoming nodes on each iteration

z

ParamsreturnRoot

Should the root set of documents (found by initial query) be returned. Default: true

z

ParamsreturnOnlyLeaf

Should only leaf documents be returned. Default: false

z

Streaming Expressions

• New alternative way of creating and processing queries • allow chaining functions • also experimental • graph functions - shortestPath, gatherNodes, scoreNodes

z

Streaming Expressionsexample

z

shortestPath

• one of the source functions - function producing tuple stream • returns shortest path between to given nodes using iterative breadth-first search of the graph

z

shortestPath - params

• collection - collection to perform the search • from - starting node • to - ending node • edge - definition of edge, in format <from-field>=<to_field> • fq - filter query, which filters out nodes taken into account • maxDepth - maximal depth of the traversal

z

gatherNodes

• transforms input document stream to stream of accessible, through graph traversal, documents

• can return edges • allows nesting functions • works for multi-collection streams, irregardless of number of cluster nodes • is also a source function • currently does not support multivalued fields

z

gatherNodes - params

• collection - collection on which function will be performed • walk - defines starting nodes and the field, e.g. „[email protected]>from” • gather - defines which fields are gathered • scatter - parameter that can have values(one or both):

• leaves - emits only leaf nodes (outer-most ones) • branches - emits nodes leading up to leaves (root node is a branch)

• fq - filter query that filters out nodes • maxDocFreq - every node in the result over this number is filtered out

Aggregations, cross-collection gathering and combining with other streaming expressions is possible

z

scoreNodes

• Function user only with output of gatherNodes • Score document relevancy, using TF-IDF formula

• As TF - how often document appeared on graph traversal • IDF is fetched from documents original collection

• Adds additional field, nodeScore, to the output stream

z

Thank you!