scaling search to a million pages with solr, python, and django

Scaling search to a million pageswith Solr, Django and Python

Toby Whitetoby@timetric.com@tow21

1,079,446!!!

Data store

Big Bad Web

Django

Data store

Big Bad Web

Django

Key-Value Store

FilesystemBerkeley DB

} unstructured

structured-

Foreign Key (RDBMS)

SQLiteMySQLPostgresOracle...

related contentthrough JOINs

overstructured data

Search Engines

Solr (Lucene)Xapian(Whoosh)

Denormalized,Inverted Index

over unstructured/semi-structured data

http://www.postgresql.org/docs/8.4/static/textsearch.htmlhttp://code.google.com/p/djangosearch/

http://www.sphinxsearch.com/

Other routes to full-text search

Solr: HTTP interface to Lucene

Lucene written by Doug Cutting (HADOOP), first release 2001.

Solr in-house CNET project, open-sourced in 2006

Solr + Lucene merged in March 2010

Solr 1.4, Lucene 3.0 released November 2009

Next version - 1.5/3.1/4.0 - not for production use yet.

SolrIndex

composed ofDocuments

ALL DOCUMENTS HAVETHE SAME STRUCTURE

RDBMSTable

composed ofRows

•Optional columns•Denormalized data

Contributer(M2M Person)

Author(FK Person)

Magazine

Editor(FK Person)

First name

Last name

Person

Publication Frequency

ISBNmultiValued,

defaultDefault Search

Identifier

Document

Pub. Frequency

multiValued

required

uniqueKey

Associated name

Entity type

Field options

Associated NameDefault Search

TitlecopyField

There is no update, only overwrite!!!

Solar Enterprise

Search Server

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr 1.4 Enterprise

Search Server

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr can't overwrite without a uniqueKey

Schema design

What do you want to search on?

What do you want to do with results?

╳query

textintlongfloatdoubledate

<xml>,csv,

<xml>,{json},exec. python

Ingest Output

Query:URL-escaped Lucene query syntax

(yuck)

HTTP HTTP

GET http://localhost:8983/solr/select/?q=searchterm

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR

+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+is_index

%3Atrue%5E100%29

Need ORM equivalent (OIM?)

http://haystacksearch.org/

http://timetric.com/about/opensource/#sunburnt

(cleaves close to Django, not schema-driven)

Sunburnt:

http://github.com/tow/sunburnt

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A

%22united+kingdom%22%29+OR+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united

+kingdom%22+AND+is_index%3Atrue%5E100%29

solr.query(tags="ons:dataseries-fullid=YBUKQA")\ .query(tags="united kingdom")\ .filter(private=False)\ .boost_relevancy(100, is_index=True)\ .facet_by("tags", mincount=1, limit=20)\ .paginate(rows=20)

FacetingMoreLikeThisHighlightingPaginationSorting

http://wiki.apache.org/solr/FrontPage

http://packtpub.com/solr-1-4-enterprise-search-server

Scaling to a million pages ...

- talk to the Guardian (Content API)

Decouple read/writeRe-indexing/optimizing strategiesFieldType/Analyzer/Tokenizer tweaks

Decouple read/write

Separate processes - many readers, single write pipeline. Beware multiple writers!

Remember standard DB practice -write to master, read from slave.

IndexIndex

Adddocuments

Commit

Index Optimize

Warm upfacet cache

"UK crime: Betting, gaming and lotteries (year ending 5th April)"

BettingTokenizer

Analyzer(Porter stemmer)

Belgium, Unemployment rate by gender, Total (BE,T)

BE,TTokenizer

(whitespace)

Tokenizer(character filter)

Understand Solr schemas - build one for your data.how do you want to query?

how do you want to show results?

Understand Solr architecture - build around your data-flow.how/when do you want to read/write?

what shape/characteristics does your corpus have

In the small

In the large

Thanks for listening!

questions welcome ...

toby@timetric.com@tow21

scaling search to a million pages with solr, python, and django

lucene lucene

pageswith solr

solr rdbmsindextable

solr architecture

solr schemas

comtowsunburnt http

http interface

book book solar solr

Technology

scaling search with solr cloud

scaling django web apps -...

scaling big data with hadoop and solr second edition -...

scaling solr with solr cloud

scaling big data search with solr and hbase

optimizing solr to improve...

typo3 camp poznan - solr usecases with hosted solr

schemaless solr and the solr schema rest api

scaling recommendations, semantic search, & data analytics...

scaling up solr 4.1 to power big search in social media...

scaling django apps with amazon aws

scaling django to the sky

inside solr 5 - bangalore solr/lucene meetup

scaling search at trovit with solr and hadoop

scaling django with gevent

solr -...

scaling search in oak with solr

optimizing solr to improve search -...

advanced search with solr & django-haystack

nyc lucene/solr meetup: spark / solr