apache solr: beyond the box · 2008. 11. 3. · 4 what is solr (to users) information retrieval...

Post on 14-Sep-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache SolrBeyond The Box

Chris Hostetter2008-11-05

http://people.apache.org/~hossman/apachecon2008us/

http://lucene.apache.org/solr/

2

Why Are We Here?

Plugins!

●What, How, Where, When, Why?●Solr Internals In A Nutshell●Real World Examples●Testing●Questions

3

What, How, Where, Who, When, Why?

4

What Is Solr (To Users)● Information Retrieval Application● Index/Query Via HTTP●Comprehensive HTML Administration Interfaces●Scalability - Efficient Replication To Other Solr

Search Servers●Highly Configurable Caching●Flexible And Adaptable With XML Configuration

Customizable Request Handlers And Response Writers

Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And

TokenFilters

What Is Solr (To Developers)● Information Retrieval Application● Java5 WebApp (WAR) With A Web Services-ish API●Extensible Plugin Architecture●MVC-ish Framework Around The Java Lucene

Search Library●Allows Custom Business Logic and Text Analysis

Rules To Live Close To The Data●Abstracts Away The Tricky Stuff:

Index Consistency Data Replication Cache Management

How It Started

When/Why To Write A Plugin

“X can be done more

efficiently closer to the data.”

OR

“To force X

for all clients.”

8

Solr Internals In A Nutshell

9

50,000' ViewHTTP

SolrDispatchFilter

Java

EmbeddedSolrServer

SolrCore

SolrCore

SolrCore

SolrRequestHandler

CoreContainer

SolrQuery(Request/Response)

QueryResponseWriter

MVC-ish●SolrRequestHandler ... A Controller

handleRequest( SolrQueryRequest,SolrQueryResponse )

●SolrQueryRequest ... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References

●SolrQueryResponse ... Model Tree of "Simple" Objects and DocLists

●ResponseWriter ... View write(Writer, SolrQueryRequest,

SolrQueryResponse)

11

public class HelloWorld extends RequestHandlerBase {

  public void handleRequestBody(SolrQueryRequest req,

                                SolrQueryResponse rsp) {

    String name = req.getParams().get("name");

    Integer age = req.getParams().getInt("age");

    rsp.add("greeting", "Hello " + name);

    rsp.add("yourage", age);

  }

  public String getVersion() { return "$Revision:$"; }

  public String getSource() { return "$Id:$"; }

  public String getSourceId() { return "$URL:$"; }

  public String getDescription() { return "Says Hello"; }

}

Hello World

12

http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml

    <response>

      <lst name="responseHeader">

        <int name="status">0</int>

        <int name="QTime">1</int>

      </lst>

      <str name="greeting">Hello Hoss</str>

      <int name="yourage">32</int>

    </response>

http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json

    { "responseHeader":{ "status":0, "Qtime":1},

      "greeting":"Hello Hoss",

      "yourage":32

    }

Hello World Output

Types Of Plugins● SolrRequestHandlerSolrRequestHandler

SearchComponentSearchComponent QparserPluginQparserPlugin ValueSourceParserValueSourceParser

● SolrHighlighterSolrHighlighter SolrFragmenterSolrFragmenter SolrFormatterSolrFormatter

● UpdateRequestProcessorFactoryUpdateRequestProcessorFactory● QueryResponseWriterQueryResponseWriter

Italics: Only One Per SolrCore

CCololoror: Likelihood Of Needing To Write Your Own

● Similarity(Factory)Similarity(Factory)● AnalyzerAnalyzer

TokenizerFactoryTokenizerFactory TokenFilterFactoryTokenFilterFactory

● FieldTypeFieldType

● SolrCacheSolrCache CacheRegeneratorCacheRegenerator

● SolrEventListenerSolrEventListener● UpdateHandlerUpdateHandler

14

Real World Examples

15

Tibetan And Himalayan Digital Library Tools

16

   public class TshegBarTokenizerFactory 

                extends BaseTokenizerFactory {

     public TokenStream create(Reader input) {

       return new TshegBarTokenizer(input);

     }

   }

   public class EdgeTshegTrimmerFactory 

                extends BaseTokenFilterFactory {

       public TokenStream create(TokenStream input) {

           return new EdgeTshegTrimmer(input);

       }

   }

Tsheg Analysis Factories

17

DFLL

DFLL: Faceted Browsing

DFLL Category Metadata●Category ID and Label: 3126 == “Tablet PCs”

●Category Query: tablet_form:[* TO *]●Ordered List of Facets

Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints

● Constraint ID and Label: 111536 == “Apple OS X”● Constraint Query: os:(“OSX10.1” “OSX10.2” ...)

20

Document catMetaDoc = searcher.getFirstMatch(catDocId)

Metadata m = parseAndCacheMetadata(catMetaDoc, searcher)

m = m.clone()

DocListAndSet results =

              searcher.getDocListAndSet(m.catQuery, ...)

response.add(“products”, results.docList)

foreach (Facet f : m) {

  foreach (Constraint c : f) {

    c.setCount(searcher.numDocs(c.query,

                                results.docSet))

  }

}

response.add(“metadata”, m.asSimpleObjects())

DfllHandler Psuedo-Code

Conceptual Picture

DocList

getDocListAndSet(Query,Query[],Sort,offset,n)

os:(“OSX10.1” “OSX10.2” ...)

memory:[1GB TO *]

tablet_form:[* TO *] price ascproc_manu:Intel

proc_manu:AMD

Section of ordered results

DocSet

Unordered set of all results

price:[0 TO 500]

price:[500 TO 1000]

manu:Dell

manu:HP

manu:LenovonumDocs()

= 594

= 382

= 247

= 689

= 104

= 92

= 75

Query Response

22

<result name="products" numFound="394" start="0">...</results>

<lst name="metadata">

 ...

 <lst name="500016">

   <int name="rankDir">0</int><int name="datatype">1</int>

   <int name="rating">88</int><str name="name">OS provided</str>

   <lst name="values">

     <lst name="111536">

       <int name="valueId">111536</int>

       <str name="label">Apple Mac OS X</str>

       <str name="rating">50</str>

       <int name="count">1</int>

     </lst>

     ...

   </lst>

DFLL Response

23

DfllCacheRegeneratorSolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit).

 public interface CacheRegenerator {

   public boolean regenerateItem(SolrIndexSearcher newSearcher,

                                 SolrCache newCache, 

                                 SolrCache oldCache, 

                                 Object oldKey, 

                                 Object oldVal) 

          throws IOException;

}

24

DataImportHandler

25

Builds and incrementally updates indexes based on configured SQL or XPath queries.

<entity name="item" pk="ID" query="select * from ITEM"

   deltaQuery="select ID ... where 

               ITEMDATE > '${dataimporter.last_index_time}'">

 <field column="NAME" name="name" />

 ...

 <entity name="f" pk="ITEMID" 

    query="select DESC from FEATURE where ITEMID='${item.ID}'"

    deltaQuery="select ITEMID from FEATURE where 

                UPDATEDATE > '${dataimporter.last_index_time}'"

    parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">

  <field name="features" column="DESC" />

  ...

DataImportHandler

DataImportHandler Plugins●DataSource

FileDataSource HttpDataSource JdbcDataSource

●EntityProcessor FileListEntityProcessor SqlEntityProcessor

● CachedSqlEntityProcessor

XPathEntityProcessor

●Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer

27

LocalSolr

LocalSolr

LocalUpdateProcessorFactory●Uses lat/lon fields to compute Cartesian Tier info●Adds grid bodes of various sizes as new fields

 <updateRequestProcessorChain name="standard" default=”true”>

   <processor class="....LocalUpdateProcessorFactory">

      <str name="latField">lat</str>

      <str name="lngField">lng</str>

      <int name="startTier">9</int>

      <int name="endTier">17</int>

   </processor>

   <processor class="solr.LogUpdateProcessorFactory" />

   <processor class="solr.RunUpdateProcessorFactory" />

 </updateRequestProcessorChain>

LocalSolr Cartesian Tiers

LocalSolrQueryComponent●Use in place of default QueryComponent●Augments regular query with DistanceQuery and

DistanceSortSource●Can use a custom SolrCache for distances for

commonly used points

  <searchComponent name="geoquery"

                   class="....LocalSolrQueryComponent" />

  <requestHandler name="geo" class="solr.SearchHandler">

     <arr name="components">

       <str>geoquery</str>

       ...

     </arr>

  </requestHandler>

32

GuardianComponent

GuardianComponent Goal●When Searching Really Short Docs, Rule Out

Matches That Are “Significantly” Longer Then Query

● Increase Precision At The Expense Of Recall  

    q = Dance Party  

  Dance Party (1995)

  Dance Party (2005) (V)

  Dance Party, USA (2006)

  Workout Party... Let's Dance! (2004) (V)

  Shrek in the Swamp Karaoke Dance Party (2001) (V)

Implementation●SearchComponent●Configured To Run After QueryComponent●Post-Processes DocList

Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“

Alternate Approach●<copyField source=“title” dest=“titleLen”/>

●Write TokenCountingTokenFilter For titleLen

●Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses

From Super Add +titleLen:[* TO MAX_LEN] Clause To Query

36

Testing Your Plugins

37

AbstractSolrTestCasepublic class YourTest extends AbstractSolrTestCase {

  ...

  public void testSomeStuff() throws Exception {

    assertU(adoc("id", "7",    "description", "Travel Guide”,

                  "title", "Paris in 10 Days"));

    assertU(adoc("id", "42",   "description", "Cool Book",

                 "title", "Hitch Hiker's Guide to the Galaxy"));

    assertU(commit());

    assertQ("multi qf", req("q",  "guide",

                            "qt", "dismax",

                            "qf", "title^2 description^1") 

            ,"//*[@numFound='2']"

            ,"//result/doc[1]/int[@name='id'][.='42']"

            ,"//result/doc[2]/int[@name='id'][.='7']"

            );

  }

38

Questions?

http://lucene.apache.org/solr/

?

top related