getting started faster with lucidworks for solr

31
From Search to Found Grant Ingersoll Eran Yaniv Thursday, August 6, 2009

Upload: ethan-ray

Post on 18-Feb-2016

241 views

Category:

Documents


0 download

DESCRIPTION

* Open source search with Solr/Lucene gives you the power to turn a wide range of information into fast, useful, relevant results! * LucidWorks for Solr gives you a tested, release-stable certified distribution of open source search with enhanced tools and installation for building search apps quickly and reliably. http://www.lucidimagination.com/How-We-Can-Help/webinar-from-search-to-found

TRANSCRIPT

Page 1: Getting started faster with LucidWorks for Solr

From Search to Found

Grant Ingersoll ‐

Eran YanivThursday, August 6, 2009

Page 2: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Agenda

Introductions

Apache Solr background

LucidWorks for Solr

Installing LucidWorks for Solr

Searching your domain with Solr

Putting Solr into production

Questions

Page 3: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Introductions

Grant Ingersoll

Lucene/Solr committer

Co‐founder Apache Mahout project

Co‐author of upcoming “Taming Text”

Eran Yaniv

Lucid Solutions Manager

Background

Product management

Enterprise Development/IT

Information Retrieval

Page 4: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Apache Solr Background

Lucene‐based Search server plus many enterprise tools

REST‐like API

Faceting

Distributed/Replication

Easy configuration

Many other features: 

http://lucene.apache.org/solr/features.html

Created at CNET by Yonik Seeley (Lucid co‐founder)

Donated to the Apache Software Foundation in 2006

Solr 1.4 release coming soon

Page 5: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Solr Basics

Content is modeled via Documents and Fields

Content can be text, integers, floats, dates, custom

Analysis can be employed to alter content before indexing

Controlled via schema.xml

Searches are supported through a wide range of Query 

options

Keyword

Terms

Phrases

Wildcards, other

Many clients available: HTTP, Java, Ruby, PHP, .NET, etc.

Page 6: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Solr Basics

Schema

Define Field Types, Fields, field metadata and Analysis

<field name="name" type="text" indexed="true" 

stored="true"/>

Copy Fields, Dynamic Fields, Similarity overrides

Solr Config

Define low‐level Lucene controls

Specify how clients interact with Solr via Request Handlers (“mini 

servlets”)

Configure highlighting, spell checking, admin, etc. 

Page 7: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

LucidWorks for Solr

Based on Apache Solr 1.3 plus

Installer for Linux and Windows

Specific patches from Solr 

faceting improvements, other

30‐day free “Get Started”

program

Bundled:

JRE

Apache Tomcat

Optimized KStemmer

implementation

Luke

Lucid Gaze for Solr

Page 8: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Getting Started

1.

Install Lucid Works

2.

Model your domain

3.

Index your content

4.

Test

5.

Deploy

Page 9: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Install Lucid Works

Free certified distribution

Introduced to many new users

New users frequently use “Get Started”

Over 50% of the cases: “How to install”

Installer

Simple

Plugins

and enhancements

Updateable

Support for Linux, Windows (Mac?)

UI and headless

Page 10: Getting started faster with LucidWorks for Solr

Installer Overview

Public repository

BetaPassword protected

Early adapters

Dev ‐

Internal

Solr installer clientInstall/Uninstall certified v.Check/install updatesinstall/update componentsUpgrade to platform

Solr installer serviceHosted on lucidimagination.comManages repositories

Page 11: Getting started faster with LucidWorks for Solr
Page 12: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Starting Lucid Works

cd

<INSTALL_PATH>/lucidworks

./lucidworks.sh

start (*NIX) 

.\lucidworks.bat

start (Windows)

Point your browser at http://localhost:8983/solr/

Page 13: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Master Your Domain with Solr

Get to know your content

Get to know your users

Model in Solr

Page 14: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Modeling your Content

Collection/Aggregate

Examine collection level stats, like:

MIME Types

Number of Docs

Update rates

Languages present

Much, much more

Look for patterns and relationships

Identify helpful resources

Page 15: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Modeling your Content

Randomly sample a set of your documents

Look for:

Common structures like titles, tables, columns, etc.

Important metadata

Tokenization issues

Try out in http://localhost:8983/solr/admin/analysis.jsp

Importance Indicators

May also look at paragraph, sentence, word and character issues

Often useful to run docs through indexing process in an 

iterative process

Page 16: Getting started faster with LucidWorks for Solr

Understanding your Users

UI Expectations

Speed and Relevance

Search and Discovery

Search

Faceting

Did you mean?

Similar Pages (More Like This)

Highlighting

Document/Results Clustering

Page 17: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Build your Application

Map your content into Documents and Fields via the Solr schema

Setup your Solr access patterns in the solrconfig.xml

Index your content 

Search

Page 18: Getting started faster with LucidWorks for Solr

Indexing

Many Clients

Java, PHP, Ruby, etc.

See example/exampledocs

Pull from DB, others

Upload CSV, Solr XML<add><doc>

<field 

name="id">EN7800GTX/2DHTV/25

6M</field>

<field name="manu">ASUS Computer 

Inc.</field>

<field name="cat">electronics</field>

</doc></add>

Page 19: Getting started faster with LucidWorks for Solr

Search

Clients also support search 

through API calls

HTTP support by 

definition:

http://localhost:8983/sol

r/select/?q=*:*&fl=score,

id

http://localhost:8983/sol

r/select/?q=name:iPod&f

l=score,id

Page 20: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Load Testing

Solr scales quite well, but you should still load test to 

establish performance specs for your application

Apache JMeter

can be a good start

Ideally, playback old logs at the rate they occurred

As with any Java application, keep an eye on JVM factors 

like heap size and garbage collection

Page 21: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Improving Performance

Search

Avoid wildcards, or at least require prefix

Catch‐all field for “generic”

search

Choose proper faceting method for the situation

Replicate/Shard

Indexing

Minimal analysis to achieve results (speeds indexing)

Multi‐threaded, batch submission

Usual Suspects:  CPU, Memory, Disk, JVM

http://www.lucidimagination.com/Community/Hear‐from‐

the‐Experts/Articles/Scaling‐Lucene‐and‐Solr/

Page 22: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Relevance Testing

Often overlooked until there is a problem; instead plan for it 

upfront

Types:

Ad hoc

Log based/ QA driven

Standard Collections and Queries (TREC)

Best Practice:  Take top 50 or so queries by volume, plus ~20 

random queries and rate the top ten results as relevant, 

somewhat relevant, not relevant, embarrassing

Page 23: Getting started faster with LucidWorks for Solr

Troubleshooting Relevance in LucidWorks for Solr

Add an &debugQuery=true to any Query:Provides info on why doc scored the way it did, plus 

other info about the Query

http://localhost:8983/solr/select/?q=*:*&de

bugQuery=true

Solr’s built in 

LukeRequestHandler

Luke, the Lucene 

index 

browser

lucidworks/luke.(sh|bat)

Page 24: Getting started faster with LucidWorks for Solr

Improving your Search

Common Techniques

Analysis:

Lowercase, stemming, 

synonyms, stopwords, 

compound analysis (e.g. STR‐

AV220 ‐> STR AV 220)

Boosts (query and index)

Faceting and other 

navigational aids

Spell Checking

Page 25: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Improving your Queries

Disjunction Max Query (more in a minute)

Better stop word handling

Phrase Queries and other Position‐based Queries

“quick red fox”~3

Recency/Freshness

Invisible Queries

Relevance Feedback and “More Like This”

Fake Queries

Page 26: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Disjunction Max Query

Useful when searching across multiple fields

Example (thanks to Chuck Williams)

•Doc1:

•t: elephant

•d: elephant

•Doc2:

•t: elephant

•d: albino

•Query: t:elephant

d:elephant

t:albino

d:albino

Each Doc scores the same for BooleanQuery

DisjunctionMaxQuery

scores Doc2 higher

Page 27: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Advanced Techniques

Payloads

http://www.lucidimagination.com/blog/2009/08/05/getting‐

started‐with‐payloads/

DelimitedPayloadTokenFilter

(better name?)

Add payloads inline:  foo|2.3 bar|5.4

BoostingFunctionTermQuery

(Lucene 2.9, Solr 1.4)

Natural Language Processing

Named Entity Extraction (OpenNLP, Stanford NER, Commercial)

Sentiment Analysis

Event Detection

Relationship Identification

Page 28: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Solr in Production

Hardware

Monitoring

Lucid Gaze for Solr

Nagios, Hyperic, Port monitoring

Troubleshooting

Solr Community – ad hoc support

Lucid Support –

Commercial support with SLAs

Growth

Query Volume

Index Size

Page 29: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Lucid Gaze for Solr

Monitor Solr Request Handlers

Comes with LucidWorks for Solr

http://localhost:8983/gaze

Page 30: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Page 31: Getting started faster with LucidWorks for Solr

Lucid Imagination, Inc.

Resources

Websites

http://www.lucidimagination.com

http://search.lucidimagination.com

http://lucene.apache.org/solr

Solr Support and Training

http://www.lucidimagination.com/How‐We‐Can‐Help

SLAs, Public, Private and Online Training for Solr and Lucene

Mailing Lists

solr‐[email protected]