introduction to apache solr

29
Introduction to Apache Solr Soware is eating the world The search is eating the soware April 2014

Upload: alexandre-rafalovitch

Post on 27-Jan-2015

113 views

Category:

Education


5 download

DESCRIPTION

Introduction to Solr, presented at Bangkok meetup in April 2014: http://www.meetup.com/bkk-web/events/172090992/ Covers high-level use-cases for Solr. Demos include support for Thai language (with GitHub link for source). Has slides showcasing Solr-ecosystem as well as couple of ideas for possible Solr-specific learning projects.

TRANSCRIPT

Page 1: Introduction to Apache Solr

Introduction to Apache SolrSoftware is eating the world"

The search is eating the software

April 2014

Page 2: Introduction to Apache Solr

2

Alexandre Rafalovitch

www.outerthoughts.com

Page 3: Introduction to Apache Solr

Web search engines !are quite sophisticated

3

Page 4: Introduction to Apache Solr

4

Page 5: Introduction to Apache Solr

But the real search needs !are!

much DEEPER and BROADER

5

Page 6: Introduction to Apache Solr

Searching code

6

Page 7: Introduction to Apache Solr

Searching people and companies

7

Page 8: Introduction to Apache Solr

Searching products

8

Page 9: Introduction to Apache Solr

Searching library material

9

Page 10: Introduction to Apache Solr

Searching languages

10

Page 11: Introduction to Apache Solr

Understanding full-text search

SELECT * FROM database WHERE field LIKE ‘%word%’"

This DOES NOT Scale"

Instead: "

break text into tokens"

domain-specific processing (e.g. lower-casing)"

build fast-access structures"

algorithms for term, phrases, proximity search

11

Page 12: Introduction to Apache Solr

Basic search engine features

Search (Duh!): keyword, phrase, field-specific"

Positive and negative terms"

Sort: relevancy, recency"

Pagination"

Compact summary in results"

SPEED

12

Page 13: Introduction to Apache Solr

Advanced search engine features

Facets/Taxonomy - based navigation with live counts"

Language-specific processing"

Domain-specific text processing (WiFi = Wi-Fi = WIFI)"

Geographic search"

More-like-this, did-you-mean, autocomplete"

Scaling/Clustering"

NOT web crawling - different, but related

13

Page 14: Introduction to Apache Solr

Search engine solutions?

Solr"

Elastic Search"

Xapian"

Sphinx"

Zoie"

Groonga"

Searchdaimon"

{F}lexSearch"

Algolia (SaaS)"

Searchify (SaaS)"

ForageJS"

Lunr.js"

FACT-Finder"

DtSearch"

MarkLogic"

Verity"

Fast"

Most databases"

!

!

…AND MORE

14

Page 15: Introduction to Apache Solr

Used with permission from SemaText

Open Source Search Evolution

15

Page 16: Introduction to Apache Solr

Secret Ingredient - Lucene

Solr"

Elastic Search"

Zoie"

SwiftType"

PyLucene (Python wrapper)"

Lucene.net (C# port)

Scalable, high-performance indexing"

Incremental indexing"

Full-text search"

Information-Retrieval algorithms"

Implemented in Java"

Written in 1999, still going strong

16

Page 17: Introduction to Apache Solr

Secret Ingredient - SolrCertified distributions"

LucidWorks"

HelioSearch"

Big Data platforms"

Cloudera"

Hortonworks HDP"

Hosted and SaaS"

Amazon CloudSearch"

WebSolr, SolrHQ, SearchBox

Lucene full-text-search"

XML and REST config"

Schema/Schemaless"

SolrCloud (clustering)"

Caching"

Near real-time"

Rich-document indexing (Tika inside)"

Plugins, components, processors

17

Page 18: Introduction to Apache Solr

Solr Ecosystem sample

Drupal"

Project Blacklight"

LuxDB"

SolrMeter"

CrafterCMS"

Typo3"

Magenta"

HippoCMS"

ColdFusion"

SolrNet"

DataStax"

Dovecot"

NGData Lily"

Basho Riak"

YaCy"

Apache ManifoldCF"

Apache Camel"

Franz Allegrograph"

BitNami Solr Stack"

Carrot2!

Broadleaf Commerce"

Cloudera CDK!

CodeLibs Fess (フェス)!

Splunk"

Alfresco"

Rosette by BasisTech!

Luwak by Flax!

Quepid by OSC!

TwigKit!

SPM by SemaText!

SILK by LucidWorks!

Banana (O/S Solr

Kibana)

18

Page 19: Introduction to Apache Solr

DEMO Time

19

Page 20: Introduction to Apache Solr

DEMO - Basic

Unzip"

Go to example directory"

Run Solr"

Import some documents from example docs"

grep -l store *.xml | xargs ./post.sh"

Show off Solr 4 admin panel

20

Page 21: Introduction to Apache Solr

DEMO - Browse handler

Restart Solr with -Dsolr.clustering.enabled=true"

Visit http://localhost:8983/solr/browse/ "

Show off"Search"

Facets - Categories and Ranges"

Spatial/Geo-distance"

Clusters

21

Page 22: Introduction to Apache Solr

DEMO - Thai specific

Index Thai and English text"

Search in English, Thai, Auto-transliterated Thai"

Show Analysis screen"

Code at: https://github.com/arafalov/solr-thai-test

22

Page 23: Introduction to Apache Solr

Getting into Solr

23

Page 24: Introduction to Apache Solr

Start for free

Download, unzip, cd example; java -jar start.jar"

Go through basic tutorial in docs/tutorial.html"

Copy example directory, modify schema.xml until happy"

If coming from ElasticSearch, look at example-schemaless"

Do NOT follow this path to production"

example schema is a kitchen sink !!!

24

Page 25: Introduction to Apache Solr

Accelerate your learning

Buy my book - seriously. That’s what it’s for"

All code/data is at: https://github.com/arafalov/solr-indexing-book "

Buy Solr In Action - just published and is a great reference"

Use my www.solr-start.com resource and join the mailing list"

Join solr-user mailing list - full of advanced hackers"

Watch Lucid Revolution videos for background"

Start helping out on Stack Overflow #solr"

Blog what you learned, twit with #Solr

25

Page 26: Introduction to Apache Solr

Pick a project - make it happen

Solr + Dart => Better search experience for Dart packages"

Solr consultants discovery website"

Visualise Solr search request - step by step"

Solr + your language => is client library up to date?"

ToDoMVC for Solr clients"

Package LARGE dataset for others (e.g. Project Gutenberg)"

Rebuild lernu.net Esperanto dictionary with Solr backend

26

Page 27: Introduction to Apache Solr

With Solr, how far can I go?

Cloudera (BigData) has > 1,000,000,000 $USD investments - opportunities?"

8M+ searches/day, 40 languages, 100ms NRT, 1024 cores, 256 shards, 32 servers on #solr at Bloomberg http://bit.ly/1jmG72G (via @FlaxSearch)

27

Page 28: Introduction to Apache Solr

Other Search-related books

Designing the Search Experience: The Information Architecture of Discovery - by a TwigKit creator +1"

Search Analytics for Your Site: Conversations with Your Customers by Louis Rosenfeld - see also Quepid"

Enterprise Search by Martin White

28

Page 29: Introduction to Apache Solr

29

Alexandre Rafalovitch

www.outerthoughts.com