elasticsearch distributed search & analytics on bigdata made easy

36
Itamar Syn-Hershko http://code972.com @synhershko Elasticsearch Distributed search & analytics on BigData made easy

Upload: itamar

Post on 15-Jul-2015

720 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Itamar Syn-Hershko

http://code972.com

@synhershko

ElasticsearchDistributed search & analytics on

BigData made easy

Me?

• Itamar Syn-Hershko / @synhershko

• Lucene.NET PMC and lead committer

• Freelance consultant and developer

• Elasticsearch consulting partner

• Microsoft MVP

• RavenDB

– X-Core developer

– “RavenDB in Action” author

Consulting Partner

An index

Elasticsearch

• Powered by Apache Lucene

• Open-source

• Rapid growth

• High profile users world-wide

REST API

• Indexes• Types• IDs

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{"user" : "synhershko","post_date" : "2013-05-30T14:12:12","message" : "trying out Elastic Search","followers": 3,"registered": true

}'

Full-Text Search

DocumentsTerm

<6>and

<2> <3>big

<6>dark

<4>did

<2>gown

<3>had

<2> <3>house

<1> <2> <3> <5> <6>in

<1> <3> <5>keep

<1> <4> <5>keeper

<1> <5> <6>keeps

<6>light

<4>never

<1> <4> <5>night

<1> <2> <3> <4>old

<4>sleep

<6>sleeps

<1> <2> <3> <4> <5> <6>the

<1> <3>town

<4>where

The index:

Dictionary and

posting lists

6 documents to index

Example from:

Justin Zobel , Alistair Moffat,

Inverted files for text search engines,

ACM Computing Surveys (CSUR)

v.38 n.2, p.6-es, 2006

The old night keeper keeps the keep in the town1

In the big old house in the big old gown.2

The house in the town had the big old keep3

Where the old night keeper never did sleep.4

The night keeper keeps the keep in the night5

And keeps in the dark and sleeps in the light.6

Full-text Search 101:The inverted index

Full-text Search 101:The inverted index

DocumentsTerm

<6>and

<2> <3>big

<6>dark

<4>did

<2>gown

<3>had

<2> <3>house

<1> <2> <3> <5> <6>in

<1> <3> <5>keep

<1> <4> <5>keeper

<1> <5> <6>keeps

<6>light

<4>never

<1> <4> <5>night

<1> <2> <3> <4>old

<4>sleep

<6>sleeps

<1> <2> <3> <4> <5> <6>the

<1> <3>town

<4>where

The index:

Dictionary and

posting lists

6 documents to index

The old night keeper keeps the keep in the town1

In the big old house in the big old gown.2

The house in the town had the big old keep3

Where the old night keeper never did sleep.4

The night keeper keeps the keep in the night5

And keeps in the dark and sleeps in the light.6

User queries for “keeper”

Term NormalizationDocumentsTerm

<6>and

<2> <3>big

<6>dark

<4>did

<2>gown

<3>had

<2> <3>house

<1> <2> <3> <5> <6>in

<1> <3> <5>keep

<1> <4> <5>keeper

<1> <5> <6>keeps

<6>light

<4>never

<1> <4> <5>night

<1> <2> <3> <4>old

<4>sleep

<6>sleeps

<1> <2> <3> <4> <5> <6>the

<1> <3>town

<4>where

• Lowercasing

• Stop words (grey)

• Not best practice anymore

• Stemming

• Porter stemmer

• s-stemmer

• Relevance++

• SizeOnDisk--

Full-Text Search

Your data store

How hard is it to get search right, anyway?

Relevance

• PrecisionThe fraction of the retrieved documents that are relevant

• RecallThe fraction of the relevant documents that are retrieved

• Order of results

Challenges with search

• Relevance

• Getting the tokens right

– Tokenization

– Stemming

• Multi-lingual content

– Or other cross-cutting search concerns

• Tolerance

Real-time Analytics

Real-time Analytics

Queue(Redis)

“Shippers”

“Indexer”

Scaling out

Moar use cases!

#1: Real-Time Alerting System

Percolation

#2: Smarter query parsing

Matching inexact queries

• Phrase slop

– “Bridge of London” -> “London Bridge”

• Word-level edit distance with fuzzy queries

– ditsance -> distance

– color -> colour

#3: Offline Classification

Structuring the unstructured

• Record linkage

– Bag of words model

– “More Like This” functionality

• NLP

• Entity extraction

#4: Everything is searchable

Geo-spatial search

• Distance

• Shape interactions

• Multiple algorithms

Geo-spatial search

Image search

http://colors.qbox.io/

http://cs.stanford.edu/people/karpathy/deepimagesent

Deep Visual-Semantic Alignments for Generating Image Descriptions

#5: Anomaly detection

The Significant Terms Aggregation

Uncommonly common

Mark Harwood’s talk at

http://www.infoq.com/presentations/elasticsearch-revealing-uncommonly-common

#6: Debugging a distributed system

Queue(Redis)

#6: Debugging a distributed system

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gifHTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

System.NullReferenceException: Object reference not set to an instance of an object. at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add) at AjaxControlToolkit.ToolkitScriptManager.GetScriptCombineAttributes(Assembly assembly) at AjaxControlToolkit.ToolkitScriptManager.IsScriptCombinable(ScriptEntry scriptEntry) at AjaxControlToolkit.ToolkitScriptManager.OnResolveScriptReference(ScriptReferenceEventArgs e) at System.Web.UI.ScriptManager.RegisterScripts() at System.Web.UI.ScriptManager.OnPagePreRenderComplete(Object sender, EventArgs e) at System.Web.UI.Page.OnPreRenderComplete(EventArgs e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)

#7: Distributed git storage

• PoC in C# using libgit2sharp

• https://github.com/synhershko/libgit2sharp.Elasticsearch

• Kudos @nulltoken

Thank you.Questions?

Itamar Syn-Hershko

http://code972.com

@synhershko