fosdem (feb 2011) - a real-time search engine with lucene and s4
TRANSCRIPT
![Page 1: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/1.jpg)
A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval
2/5/2011 Michaël Figuière
![Page 2: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/2.jpg)
Speaker
Michaël Figuière
@mfiguiere
blog.xebia.fr
Search Engines NoSQL
DistributedArchitectures
![Page 3: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/3.jpg)
Our case study
A Search Engine to keep track of activities within an enterprise
![Page 4: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/4.jpg)
The Problem
![Page 5: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/5.jpg)
A Search Engine
Search
![Page 6: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/6.jpg)
A Search Engine
SearchMyCustomer
![Page 7: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/7.jpg)
A Search Engine
SearchMyCustomer
Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Document
Phone Call
![Page 8: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/8.jpg)
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
![Page 9: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/9.jpg)
A more complex Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
![Page 10: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/10.jpg)
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
Classifier
Classifier
Mahout
![Page 11: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/11.jpg)
More complex ...
• Entity Recognition
• Language Recognition
• Fetching linked URLs
• ...
Recognizes an entity written in any way
To index each language separately
Enhances document context by also indexing linked URLs
![Page 12: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/12.jpg)
A Real-Time Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
![Page 13: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/13.jpg)
A Real-Time Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
![Page 14: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/14.jpg)
Indexing Pipeline
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
Since Lucene 2.9
SomePre-Processing
![Page 15: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/15.jpg)
But...
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
What if it takesone second/document on a single box ??
SomePre-Processing
![Page 16: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/16.jpg)
Server 1
Let’s distribute it
Pre-Processing
SearchIndex
Server 2
Pre-Processing
SearchIndex
Server 3
Pre-Processing
SearchIndex
Server N
Processing logic and index structure distributed together
![Page 17: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/17.jpg)
That’s a problem...
• Processing and index storage may have different scaling needs
• Scaling up and down an index storage is long and complex
• Expensive pre-processing may make searches slower
Depending on the search traffic, the processing overhead, ...
Whereas stateless processing is simple to scale up/down
And indexing in real-time shouldn’t make searches slower !
![Page 18: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/18.jpg)
Let’s move it to Hadoop
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
Hadoop MapReduce
SomePre-Processing
![Page 19: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/19.jpg)
But...
• Hadoop can only deal with chunk of data
• Unbounded stream of data can’t fit into Hadoop MapReduce
• Manually bounding the stream won’t be efficient
Data must be available somewhere on HDFS
Hadoop is thought and optimized for batch processing
It’ll resulting in lot of regular and inefficient batches
![Page 20: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/20.jpg)
S4
![Page 21: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/21.jpg)
S4
• A distributed, fault-tolerant, stream processing system
• Elastic
• Project started in november 2010, still experimental
Based on Zookeeper
But things are moving fast !
![Page 22: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/22.jpg)
Where does S4 come from ?
• Open Source project created by Yahoo!
• Initially built for relevant ad selection and clever positioning on webpages
• Expensive pre-processing may make searches slower
But thought to be generic enough
And indexing in real-time shouldn’t make searches slower !
![Page 23: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/23.jpg)
Processing Element
ProcessingElement
Events OutputEvents Input
Your businesslogic goes here
![Page 24: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/24.jpg)
Processing Node
Processing Node
ProcessingElement 1
ProcessingElement 2
ProcessingElement N
![Page 25: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/25.jpg)
Processing Node 1
S4 Cluster
Processing Node 2 ZookeeperEventsStream
ClusterManagement
Processing Node N
![Page 26: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/26.jpg)
Programming model
PhoneCallPE
Accept events with :
Type=PhoneCall
KeyTuple: Id=15497 EventEventType: PhoneCall
KeyTuple: «Id=15497»
Value: <serialized object>
Type: EnrichedPhoneCall
KeyTuple: «Id=15497»
Value: <serialized object>
A new ProcessingElement instance is created for each value of «Id»
![Page 27: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/27.jpg)
An indexing pipeline with S4
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
Handles incoming eventsand load-balance themaccording to partitioning
![Page 28: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/28.jpg)
An indexing pipeline with S4
Handles result eventsand load-balance betweenProcessing Nodes
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
![Page 29: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/29.jpg)
An indexing pipeline with S4
Handles final resultevents and pushthem to the Indexer
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
![Page 30: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/30.jpg)
Some drawbacks
• The system is lossy
• A workaround is to increase the incoming queue of nodes
• Still experimental
But still, events may be lost during failure
But very promising
Events may be lost when nodes are overloaded or during failure
![Page 31: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/31.jpg)
More: Real-Time Inverted Search
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
20 new results...
Sales Juridic Accounting
![Page 32: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/32.jpg)
Summary
• S4 is a nice processing system for real-time search
• Not only for indexing-time, also for query-time !
• A promising roadmap....
As S4 ensures low latency, query-time processing is possible
Better failure handling, client API in major languages, initial processing with Hadoop, ...
Events may be lost when nodes are overloaded or during failure
![Page 33: FOSDEM (feb 2011) - A real-time search engine with Lucene and S4](https://reader034.vdocuments.net/reader034/viewer/2022052619/55635f2ad8b42a734b8b4dac/html5/thumbnails/33.jpg)
Questions / Answers
?@mfiguiere
blog.xebia.fr