bigtable search presentation to austin pug

Download BigTable Search Presentation to Austin PUG

If you can't read please download the document

Upload: percy-wegmann

Post on 16-Apr-2017

1.842 views

Category:

Technology


0 download

TRANSCRIPT

Full Text Search on Google App Enginewith BigTable Search

at Austin PUG February 10, 2010

Percy Wegmann presenting

The Problem

You want to be able to do full-text search (you know, like on Google.com)

Against data stored in a Python Google App Engine application

Without using an external server/service

Full-text Search Basic Features

Let's say that you want to search a repository of 2 documents containing the following text:

swan lake performed live at the Met

swans are crowding ducks out of the local lake

A basic search engine should respond to queries as follows:

swan lake - returns both documents (inexact matching)swan dive - returns both documents (boolean OR matching)swan lake duck - returns document 2 first (ranking)crowds - returns document 2 (stemming)of the - returns neither (stopword removal)

And it should do all of this quickly

More Advanced Features

Starts-with matching (for type-ahead completion)

Indexing of non-text fields (numeric, datetime, references, etc.)

Term weighting (e.g. rank matches on title higher than on body)

Faceted Search (like Amazon or Cnet.com)

Background indexing (to speed up inserts)

Thesaurus (mallard would match duck)

Phrase matching (exact phrases rank higher than disjointed combinations of words)

The Contenders

Stemming & Stopword RemovalBoolean ORRanking

Datastore QuerySearchableModelstopword removal onlyBill Katz' Searchablexgae-searchxBigTable Searchxxx

What The Others Are Missing

Boolean OR/Ranking Makes multi-term queries almost pointless

Faceted Search Users are accustomed to this from sites like Amazon

Scalability No one uses inverted indexes!

Introducing BigTable Search

Switch to demo

How it Works Inverted Index

Index is organized by search term. This is how the big boys (Lucene, Sphinx, etc.) do it.

Example from Wikipedia

Documentsit is what it is

what is it

it is a banana

Index (stores pointers to documents)a: {3}banana: {3}is: {1, 2, 3}it: {1, 2, 3}what: {1, 2}

To search for it, we only have to grab a single row from the index yielding {1, 2, 3}

To search for what or banana we grab two rows and take the union, yielding {1, 2, 3}

To search for what and banana we grab two rows and take intersection, yielding {}

To rank a search banana or it we take union and count occurrences, yielding {3, 1, 2}

The Pain of Updating

Remember our documents:

Documentsit is what it is

what is it

it is a banana

To add the first document, we have to update 4 index entries. The bigger the documents get, the worse it gets.

Worse, multiple documents are represented in a single index entry, so concurrency becomes a problem too try locking on the index entry for the, and your entire system becomes effectively single-threaded!

The Solution to Updating

Asynchronous Updates

DataStoredoc1.1: put

calc queue1.2: requestindexingmerge queuemerge queuemerge queue

2: queue terms

3: merge toinverted index

Code (at a Glance)

Data Model

Queues

Code

The Better Answer?

BigTable Search suffers from some significant limitations:

- Fast search engines use custom file storage formats for performance, BigTable Search does not have this option and is consequently not fast- No phrase matching- No synonym or semantic matching

Google is working on a full-text search solution(feature 217 on Issues List, In Progress, no ETA, session scheduled for Google I/O in May)

Resources

pyporter2 (used by BigTable Search and others for stemming)http://github.com/mdirolf/pyporter2

SearchableModelhttp://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/__init__.py

Bill Katz' Simple Full-text Search for App Enginehttp://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine

gae-searchhttp://gae-full-text-search.appspot.com/

BigTable Searchhttp://code.google.com/p/bigtablesearch/

Google's Upcoming Full-text Search (feature 217)http://code.google.com/p/googleappengine/issues/detail?id=217