practical full text search

Practical full-text search in MySQLBill KarwinMySQL University • 2009-12-3

Me

• 20+ years experience

• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP

• SQL maven

• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite

• Community contributor

Full Text Search

In a full text search, the search engine examines all of the words in every stored document as it tries to

match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/

http://creativecommons.org/licenses/by/2.5/


http://www.flickr.com/photos/tryingyouth/

http://www.flickr.com/photos/tryingyouth/



Test Data

• StackOverflow.com data dump, exported October 2009

• 1.5 million tuples

• ~1 Gigabyte

StackOverflow ER diagram

searchable text

Naive SearchingSome people, when confronted with a problem,

think “I know, I’ll use regular expressions.” Now they have two problems.

— Jamie Zawinsky

Accuracy issue

• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:

body LIKE ‘%one%’

• Regular expressions in MySQLsupport escapes for word boundaries:

body RLIKE ‘[[:<:]]one[[:>:]]’

Performance issue

• LIKE with wildcards:

SELECT * FROM PostsWHERE body LIKE ‘%performance%’

• POSIX regular expressions:

SELECT * FROM PostsWHERE body RLIKE ‘performance’

time: 22 sec

time: 108 sec

Why so slow?

CREATE TABLE telephone_book ( full_name VARCHAR(50));

CREATE INDEX name_idx ON telephone_book (full_name);

INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);

Why so slow?

• Search for all with last name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’

• Search for all with first name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’

uses index

doesn’t use index

Indexes don’t help searching for substrings

☞

Solutions

1. Full-Text Indexing in SQL

2. Sphinx Search

3. Apache Lucene

4. Inverted Index

5. Search Engine Service

MySQLFULLTEXT Index

MySQL FULLTEXT Index

• Special index type for MyISAM

• Integrated with SQL queries

• Balances features vs. speed vs. space

MySQL FULLTEXT:

Indexing

CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);

time: 15 min 6 sec

MySQL FULLTEXT:

Index Caching

SET GLOBAL key_buffer_size = 600*1024*1024;

LOAD INDEX INTO CACHE Posts INDEX(PostText);

time: 11 sec

MySQL FULLTEXT:

Querying

SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( ‘query pattern’ );

must includeall columns of index, in the order defined

MySQL FULLTEXT:

Natural Language Mode

Searches concepts with free text queries:

SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘improving mysql performance’ IN NATURAL LANGUAGE MODE)LIMIT 100;

time with index: 80 milliseconds

MySQL FULLTEXT:

Boolean Mode

Searches words using mini-language:

SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘+mysql +performance’ IN BOOLEAN MODE);

time with index: 50 milliseconds

Lucene

Lucene

• Apache Project since 2001

• Apache License

• Java implementation

• Ports exist for other languages:• Lucy (C)• Lucene.NET (C#)• Zend_Search_Lucene (PHP)

• PyLucene (Python)• Plucene (Perl)• Ferret (Ruby)

Lucene:

How to use

1. Add documents to index

2. Parse query

3. Execute query

Lucene:

Creating an index

• Programmatic solution in Java...

time: 6 minutes, 50 seconds

Lucene:

Indexing

String url = "jdbc:mysql://localhost/stackoverflow?" + "user=myappuser&password=xxxx";Class.forName("org.mysql.jdbc.Driver");Connection con = DriverManager.getConnection(url, props);

String sql = "SELECT PostId, Title, Body, Tags FROM Posts";com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement();stmt.enableStreamingResults();ResultSet rs = stmt.executeQuery(sql);

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

any SQL query

open Lucene index writer

Lucene:

Indexing

while (rs.next()) { Document doc = new Document();

doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc);}

writer.optimize();writer.close();

loop over SQL result

each row is a Document

with four Fields

finish and close index

Lucene:

Querying

• Parse a Lucene queryString[] fields = new String[3];fields[0] = “Title”; fields[1] = “Body”; fields[2] = “Tags”;

Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);

• Execute the querySearcher s = new IndexSearcher(indexDirectory, true);

Hits h = s.search(q);

time: 120 milliseconds

parse search query

define fields

Sphinx Search

Sphinx Search

• Started in 2001

• GPLv2 license

• Good database integration:SphinxSE storage engine for MySQL

Sphinx Search:

How to use

1. Edit configuration file

2. Index the data

3. Query the index

4. Issues

Sphinx Search:

sphinx.conf

source stackoverflowsrc{ type = mysql sql_host = localhost sql_user = myappuser sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}

Sphinx Search:

sphinx.conf

index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}

Sphinx Search:

Building index

indexer -c sphinx.conf stackoverflow

collected 1517638 docs, 1021.3 MBsorted 171.5 Mhits, 100.0% donetotal 1517638 docs, 1021342525 bytestotal 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec

time: 2 min 27 sec

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverflow -b “sql & performance”

time: 12 milliseconds

Sphinx Search:

Issues

Cost to update index = cost to build index

• Build a “main” index plus a “delta” index for recent changes

• Merge indexes periodically (much less costly)

• But not all data fits into this model; i.e. good for a forum, but bad for a wiki

Inverted Index

Inverted index

TagsPosts PostTags

many-to-many relationship for Posts

and wordssearchable

words

Inverted index:

Updated ER Diagram

new tables

Inverted index:

Data definition

CREATE TABLE Tags ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL UNIQUE KEY (Tag));

CREATE TABLE PostTags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES Tags (TagId));

Inverted index:

Indexing

1. Query all Posts.Tags strings:“<mysql><search><performance>”

2. Loop over tag strings

3. Dump two CSV files:

• Tags.csv• PostTags.csv

4. Load CSV files with mysqlimport

time: 23.5 seconds

time: 5.2 seconds

Inverted index:

Querying

SELECT p.* FROM Posts pJOIN PostTags pt USING (PostId)JOIN Tags t USING (TagId)WHERE t.Tag = ‘performance’;

250 milliseconds

Inverted Index:

Is it right for you?

• Best for searching selected words

• Simple, portable, standard SQL

• Not as fast as specialized technology,but far better than using LIKE

Search Engine Services

Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/

• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html

even big web sites use this solution

http://www.google.com/cse/

http://www.google.com/cse/

http://www.karwin.com/demo/gcse-demo.html




Search engine services:

Is it right for you?

• Your site is public and allows external index

• Search is a non-critical feature for you

• Search results are satisfactory

• You need to offload search processing

Comparison: Time to Build Index

LIKE expression none

MySQL FULLTEXT 15 min

Apache Lucene 6 min 50 sec

Sphinx Search 2 min 27 sec

Inverted index 28 sec

Google / Yahoo! offline

Comparison: Index Storage

LIKE expression none

MySQL FULLTEXT 466 MB

Apache Lucene 1323 MB

Sphinx Search 933 MB

Inverted index 48 MB

Google / Yahoo! offline

Comparison: Query Speed

LIKE expression 22 seconds

MySQL FULLTEXT 50-80 ms

Apache Lucene 120 ms

Sphinx Search 12 ms

Inverted index 250 ms

Google / Yahoo! *

Comparison: Bottom-Line

LIKE expression none none 2000x SQL

MySQL FULLTEXT 32x 10x 6x RDBMS

Apache Lucene 15x 27x 10x 3rd party

Sphinx Search 5x 20x 1x 3rd party

Inverted index 1x 1x 20x SQL

Google / Yahoo! offline offline * Service

indexing storage query solution

Copyright 2009 Bill Karwin

www.slideshare.net/billkarwin

Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and transmit this work, under the following conditions:

Attribution. You must attribute this work to Bill Karwin.

Noncommercial. You may not use this work for commercial purposes.

No Derivative Works. You may not alter, transform, or build

upon this work.

http://www.karwin.com

http://www.karwin.com

http://creativecommons.org/licenses/by-nc-nd/3.0/

http://creativecommons.org/licenses/by-nc-nd/3.0/

practical full text search

Documents