practical full text search
TRANSCRIPT
Practical full-text search in MySQLBill KarwinMySQL University • 2009-12-3
Me
• 20+ years experience
• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP
• SQL maven
• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite
• Community contributor
Full Text Search
In a full text search, the search engine examines all of the words in every stored document as it tries to
match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
Test Data
• StackOverflow.com data dump, exported October 2009
• 1.5 million tuples
• ~1 Gigabyte
StackOverflow ER diagram
searchable text
Naive SearchingSome people, when confronted with a problem,
think “I know, I’ll use regular expressions.” Now they have two problems.
— Jamie Zawinsky
Accuracy issue
• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:
body LIKE ‘%one%’
• Regular expressions in MySQLsupport escapes for word boundaries:
body RLIKE ‘[[:<:]]one[[:>:]]’
Performance issue
• LIKE with wildcards:
SELECT * FROM PostsWHERE body LIKE ‘%performance%’
• POSIX regular expressions:
SELECT * FROM PostsWHERE body RLIKE ‘performance’
time: 22 sec
time: 108 sec
Why so slow?
CREATE TABLE telephone_book ( full_name VARCHAR(50));
CREATE INDEX name_idx ON telephone_book (full_name);
INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
Why so slow?
• Search for all with last name “Thomas”
SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’
• Search for all with first name “Thomas”
SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’
uses index
doesn’t use index
Indexes don’t help searching for substrings
☞
Solutions
1. Full-Text Indexing in SQL
2. Sphinx Search
3. Apache Lucene
4. Inverted Index
5. Search Engine Service
MySQLFULLTEXT Index
MySQL FULLTEXT Index
• Special index type for MyISAM
• Integrated with SQL queries
• Balances features vs. speed vs. space
MySQL FULLTEXT:
Indexing
CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);
time: 15 min 6 sec
MySQL FULLTEXT:
Index Caching
SET GLOBAL key_buffer_size = 600*1024*1024;
LOAD INDEX INTO CACHE Posts INDEX(PostText);
time: 11 sec
MySQL FULLTEXT:
Querying
SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( ‘query pattern’ );
must includeall columns of index, in the order defined
MySQL FULLTEXT:
Natural Language Mode
Searches concepts with free text queries:
SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘improving mysql performance’ IN NATURAL LANGUAGE MODE)LIMIT 100;
time with index: 80 milliseconds
MySQL FULLTEXT:
Boolean Mode
Searches words using mini-language:
SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘+mysql +performance’ IN BOOLEAN MODE);
time with index: 50 milliseconds
Lucene
Lucene
• Apache Project since 2001
• Apache License
• Java implementation
• Ports exist for other languages:• Lucy (C)• Lucene.NET (C#)• Zend_Search_Lucene (PHP)
• PyLucene (Python)• Plucene (Perl)• Ferret (Ruby)
Lucene:
How to use
1. Add documents to index
2. Parse query
3. Execute query
Lucene:
Creating an index
• Programmatic solution in Java...
time: 6 minutes, 50 seconds
Lucene:
Indexing
String url = "jdbc:mysql://localhost/stackoverflow?" + "user=myappuser&password=xxxx";Class.forName("org.mysql.jdbc.Driver");Connection con = DriverManager.getConnection(url, props);
String sql = "SELECT PostId, Title, Body, Tags FROM Posts";com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement();stmt.enableStreamingResults();ResultSet rs = stmt.executeQuery(sql);
IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
any SQL query
open Lucene index writer
Lucene:
Indexing
while (rs.next()) { Document doc = new Document();
doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);}
writer.optimize();writer.close();
loop over SQL result
each row is a Document
with four Fields
finish and close index
Lucene:
Querying
• Parse a Lucene queryString[] fields = new String[3];fields[0] = “Title”; fields[1] = “Body”; fields[2] = “Tags”;
Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);
• Execute the querySearcher s = new IndexSearcher(indexDirectory, true);
Hits h = s.search(q);
time: 120 milliseconds
parse search query
define fields
Sphinx Search
Sphinx Search
• Started in 2001
• GPLv2 license
• Good database integration:SphinxSE storage engine for MySQL
Sphinx Search:
How to use
1. Edit configuration file
2. Index the data
3. Query the index
4. Issues
Sphinx Search:
sphinx.conf
source stackoverflowsrc{ type = mysql sql_host = localhost sql_user = myappuser sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}
Sphinx Search:
sphinx.conf
index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}
Sphinx Search:
Building index
indexer -c sphinx.conf stackoverflow
collected 1517638 docs, 1021.3 MBsorted 171.5 Mhits, 100.0% donetotal 1517638 docs, 1021342525 bytestotal 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec
time: 2 min 27 sec
Sphinx Search:
Querying index
search -c sphinx.conf -i stackoverflow -b “sql & performance”
time: 12 milliseconds
Sphinx Search:
Issues
Cost to update index = cost to build index
• Build a “main” index plus a “delta” index for recent changes
• Merge indexes periodically (much less costly)
• But not all data fits into this model; i.e. good for a forum, but bad for a wiki
Inverted Index
Inverted index
TagsPosts PostTags
many-to-many relationship for Posts
and wordssearchable
words
Inverted index:
Updated ER Diagram
new tables
Inverted index:
Data definition
CREATE TABLE Tags ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL UNIQUE KEY (Tag));
CREATE TABLE PostTags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES Tags (TagId));
Inverted index:
Indexing
1. Query all Posts.Tags strings:“<mysql><search><performance>”
2. Loop over tag strings
3. Dump two CSV files:
• Tags.csv• PostTags.csv
4. Load CSV files with mysqlimport
time: 23.5 seconds
time: 5.2 seconds
Inverted index:
Querying
SELECT p.* FROM Posts pJOIN PostTags pt USING (PostId)JOIN Tags t USING (TagId)WHERE t.Tag = ‘performance’;
250 milliseconds
Inverted Index:
Is it right for you?
• Best for searching selected words
• Simple, portable, standard SQL
• Not as fast as specialized technology,but far better than using LIKE
Search Engine Services
Search engine services:
Google Custom Search Engine
• http://www.google.com/cse/
• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html
even big web sites use this solution
Search engine services:
Is it right for you?
• Your site is public and allows external index
• Search is a non-critical feature for you
• Search results are satisfactory
• You need to offload search processing
Comparison: Time to Build Index
LIKE expression none
MySQL FULLTEXT 15 min
Apache Lucene 6 min 50 sec
Sphinx Search 2 min 27 sec
Inverted index 28 sec
Google / Yahoo! offline
Comparison: Index Storage
LIKE expression none
MySQL FULLTEXT 466 MB
Apache Lucene 1323 MB
Sphinx Search 933 MB
Inverted index 48 MB
Google / Yahoo! offline
Comparison: Query Speed
LIKE expression 22 seconds
MySQL FULLTEXT 50-80 ms
Apache Lucene 120 ms
Sphinx Search 12 ms
Inverted index 250 ms
Google / Yahoo! *
Comparison: Bottom-Line
LIKE expression none none 2000x SQL
MySQL FULLTEXT 32x 10x 6x RDBMS
Apache Lucene 15x 27x 10x 3rd party
Sphinx Search 5x 20x 1x 3rd party
Inverted index 1x 1x 20x SQL
Google / Yahoo! offline offline * Service
indexing storage query solution
Copyright 2009 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/
You are free to share - to copy, distribute and transmit this work, under the following conditions:
Attribution. You must attribute this work to Bill Karwin.
Noncommercial. You may not use this work for commercial purposes.
No Derivative Works. You may not alter, transform, or build
upon this work.