scaling search to a million pages with solr, python, and django

28
Scaling search to a million pages with Solr, Django and Python Toby White [email protected] @tow21

Upload: tow21

Post on 15-Jan-2015

7.923 views

Category:

Technology


5 download

DESCRIPTION

A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.

TRANSCRIPT

Page 1: Scaling search to a million pages with Solr, Python, and Django

Scaling search to a million pageswith Solr, Django and Python

Toby [email protected]@tow21

Page 2: Scaling search to a million pages with Solr, Python, and Django

1,079,446!!!

Page 3: Scaling search to a million pages with Solr, Python, and Django
Page 4: Scaling search to a million pages with Solr, Python, and Django

Data store

Big Bad Web

Django

Page 5: Scaling search to a million pages with Solr, Python, and Django

Data store

Big Bad Web

Django

Page 6: Scaling search to a million pages with Solr, Python, and Django

Key-Value Store

FilesystemBerkeley DB

MySQL

} unstructured

structured-

Page 7: Scaling search to a million pages with Solr, Python, and Django

Foreign Key (RDBMS)

SQLiteMySQLPostgresOracle...

related contentthrough JOINs

overstructured data

Page 8: Scaling search to a million pages with Solr, Python, and Django

Search Engines

Solr (Lucene)Xapian(Whoosh)

Denormalized,Inverted Index

over unstructured/semi-structured data

Page 9: Scaling search to a million pages with Solr, Python, and Django

http://www.postgresql.org/docs/8.4/static/textsearch.htmlhttp://code.google.com/p/djangosearch/

http://www.sphinxsearch.com/

Other routes to full-text search

Page 10: Scaling search to a million pages with Solr, Python, and Django

Solr: HTTP interface to Lucene

Lucene written by Doug Cutting (HADOOP), first release 2001.

Solr in-house CNET project, open-sourced in 2006

Solr + Lucene merged in March 2010

Solr 1.4, Lucene 3.0 released November 2009

Next version - 1.5/3.1/4.0 - not for production use yet.

Page 11: Scaling search to a million pages with Solr, Python, and Django

SolrIndex

composed ofDocuments

ALL DOCUMENTS HAVETHE SAME STRUCTURE

RDBMSTable

composed ofRows

Page 12: Scaling search to a million pages with Solr, Python, and Django

•Optional columns•Denormalized data

Contributer(M2M Person)

Author(FK Person)

Magazine

Editor(FK Person)

First name

Last name

Person

ISSN

Publication Frequency

Title

Book

Title

ISBNmultiValued,

defaultDefault Search

Identifier

Document

Pub. Frequency

Title

multiValued

required

required

uniqueKey

Associated name

Entity type

Field options

Associated NameDefault Search

TitlecopyField

Page 13: Scaling search to a million pages with Solr, Python, and Django

There is no update, only overwrite!!!

Solar Enterprise

Search Server

Book

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr 1.4 Enterprise

Search Server

Book

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr can't overwrite without a uniqueKey

Page 14: Scaling search to a million pages with Solr, Python, and Django

<field name="title" type="text" indexed="true" stored="true" required="true" multiValued="false"/>

Schema design

What do you want to search on?

What do you want to do with results?

╳query

textintlongfloatdoubledate

Page 15: Scaling search to a million pages with Solr, Python, and Django

Solr

<xml>,csv,

<xml>,{json},exec. python

Ingest Output

Query:URL-escaped Lucene query syntax

(yuck)

HTTP HTTP

Page 16: Scaling search to a million pages with Solr, Python, and Django

GET http://localhost:8983/solr/select/?q=searchterm

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR

+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+is_index

%3Atrue%5E100%29

Page 17: Scaling search to a million pages with Solr, Python, and Django

Need ORM equivalent (OIM?)

http://haystacksearch.org/

http://timetric.com/about/opensource/#sunburnt

(cleaves close to Django, not schema-driven)

Sunburnt:

http://github.com/tow/sunburnt

Page 18: Scaling search to a million pages with Solr, Python, and Django

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A

%22united+kingdom%22%29+OR+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united

+kingdom%22+AND+is_index%3Atrue%5E100%29

solr.query(tags="ons:dataseries-fullid=YBUKQA")\ .query(tags="united kingdom")\ .filter(private=False)\ .boost_relevancy(100, is_index=True)\ .facet_by("tags", mincount=1, limit=20)\ .paginate(rows=20)

Page 19: Scaling search to a million pages with Solr, Python, and Django
Page 20: Scaling search to a million pages with Solr, Python, and Django

FacetingMoreLikeThisHighlightingPaginationSorting

http://wiki.apache.org/solr/FrontPage

http://packtpub.com/solr-1-4-enterprise-search-server

Page 21: Scaling search to a million pages with Solr, Python, and Django

Scaling to a million pages ...

- talk to the Guardian (Content API)

Decouple read/writeRe-indexing/optimizing strategiesFieldType/Analyzer/Tokenizer tweaks

Page 22: Scaling search to a million pages with Solr, Python, and Django

Decouple read/write

Separate processes - many readers, single write pipeline. Beware multiple writers!

Remember standard DB practice -write to master, read from slave.

Page 23: Scaling search to a million pages with Solr, Python, and Django

Index

Index

IndexIndex

Adddocuments

Commit

Index Optimize

Fast

Index

Warm upfacet cache

Page 24: Scaling search to a million pages with Solr, Python, and Django
Page 25: Scaling search to a million pages with Solr, Python, and Django
Page 27: Scaling search to a million pages with Solr, Python, and Django

Understand Solr schemas - build one for your data.how do you want to query?

how do you want to show results?

Understand Solr architecture - build around your data-flow.how/when do you want to read/write?

what shape/characteristics does your corpus have

In the small

In the large

Page 28: Scaling search to a million pages with Solr, Python, and Django

Thanks for listening!

questions welcome ...

[email protected]@tow21