developing a big data search engine: where we have gone, where we are going
Post on 14-Jul-2015
4.951 Views
Preview:
TRANSCRIPT
‹#›© Cloudera, Inc. All rights reserved.
Mark Miller
Developing a Big Data search engine: Where we have gone, where we are going.
‹#›© Cloudera, Inc. All rights reserved.
I’m Mark Miller I’m a Lucene junkie (2006) I’m a Lucene committer (2008) And a Solr committer (2009) And the current Lucene PMC Chair (2014) And a member of the ASF (2011) I co-created SolrCloud (????)
‹#›© Cloudera, Inc. All rights reserved.
A Quick Tour Through History• First there was Lucene. • It took a little while, but soon it was ‘good enough’ to replace most enterprise search engines. And faster. And more efficient. • Lots of Search Engines built on Lucene (I made one!) • Then there was Solr.
• And then there were others.
‹#›© Cloudera, Inc. All rights reserved.
What Search Engines Matter?• Lucene search engines lead the pack.
• How can you tell? • I like to look at db-engines.org
• Also, plenty of anecdotal evidence that others are using Lucene for the core.
‹#›© Cloudera, Inc. All rights reserved.
2 Lucene based search engines in top 15. No other search engines.
‹#›© Cloudera, Inc. All rights reserved.
“It is hopeless to talk to both of you, you don't understand virtual memory.”Uwe Schindler @thetaph1 @uwesays
‹#›© Cloudera, Inc. All rights reserved.
What is the future of Search?• More NoSQL • More Realtime Analytics • More System of Record • More Scale • Search will eat away at the stack.
• Search focuses on pre processing and efficient in memory data structures for fast responses.
‹#›© Cloudera, Inc. All rights reserved.
The Solr Start - Single node, then DIY disturb• Solr started as a single node solution, followed by master->slave replication, followed by simple distributed search. • This was ‘good enough’ for a long time. • Classic ‘innovators dilemma’ problem.
• Scaling out was super important, but not as soon as some thought and sooner than others thought.
‹#›© Cloudera, Inc. All rights reserved.
SolrCloud Meets Hadoop• First class integrations with: • HDFS • MapReduce • Spark • Flume • HBase • Etc
‹#›© Cloudera, Inc. All rights reserved.
Now it’s all about scale and correctness.• The search features for the big data world are here and rapidly advancing.
• The next step is being able to handle Hadoop scale in the ‘general’ case.
• And to be able to handle that correctly ‘enough’ of the time.
‹#›© Cloudera, Inc. All rights reserved.
“In my opinion the whole code is a bug by itself.”Uwe Schindler @thetaph1 @uwesays
‹#›© Cloudera, Inc. All rights reserved.
The Call Me Maybe Tests
• https://aphyr.com/tags/jepsen • Some basic testing around how systems live up to their CAP promises. Heavy focus on partitions. • Most systems fail pretty badly. ZooKeeper rocked it. SolrCloud did pretty darn well*.
‹#›© Cloudera, Inc. All rights reserved.
Call Me … Maybe??
• Passing is actually like a very minimum bar. It doesn’t at all mean your system is correct.
• Your system could be complete crap and still pass.
• In fact, in the general case, all the current best search engines are still flakey at scale.
‹#›© Cloudera, Inc. All rights reserved.
Search at Scale is still Flakey?• Yes, yes it is. Most systems at scale are still flakey. Most systems don’t deliver on their promises.
• How does search in particular get away with it?
• Users are already used to not considering it the system of record.Its easier to scale specialized than general - project scales general, massive users scale specialized.
• We want the project to easily scale generally - no expertise needed. You can already scale pretty large, but it takes a ‘vertical’ and expertise.
‹#›© Cloudera, Inc. All rights reserved.
Search In Particular is HARD.• The search engine is a many faceted beast.
• There is a lot of surface area.
• And you want it all to work and all to work realtime and all to integrate well together.
‹#›© Cloudera, Inc. All rights reserved.
"Lucene is maybe the world's most tested open source project."Uwe Schindler @thetaph1 #bbuzz 2014
‹#›© Cloudera, Inc. All rights reserved.
Lucene Testing Framework• Lucene regularly finds bugs in new Java releases. • Seriously. Regularly.
• Many of those bugs are fixed and fixed quickly. Many are not. • Randomized testing, reproducible master seeds. • “Test Beasting” and seti@home type resource requirements.
‹#›© Cloudera, Inc. All rights reserved.
Lucene Testing Framework
• Code checkers and build enforcers galore, as well as test level checkers and enforcers.
• Who is policing the policeman?
• You need a vibrant community that gives a damn.
‹#›© Cloudera, Inc. All rights reserved.
“The stack trace is only impossible if you look at the code.”Uwe Schindler @thetaph1 @uwesays
‹#›© Cloudera, Inc. All rights reserved.
Testing is the Key and Answer• Just because your tests don't normally fail doesn't mean they are great. You probably just don’t normally see the problems.
• Our test framework exposes the problems - quickly.
• This has pluses and minuses, but the pluses greatly outweigh the minuses!
‹#›© Cloudera, Inc. All rights reserved.
More on Testing
• Integration and unit tests are equally important.
• Integration tests are a little more important.
• Testing, testing, and more testing is your best friend.
• Communities grow, communities change, one or two can’t hold the code together.
‹#›© Cloudera, Inc. All rights reserved.
Regular Large Scale Testing will be a challenge!
1000 nodes with
SolrCloud Radial View
‹#›© Cloudera, Inc. All rights reserved.
The race for scalable search is on!• My approach will be to leverage Hadoop as much as possible!
• Many companies are focused on Solr - there will be many approaches!
• It’s still early in the game.
‹#›© Cloudera, Inc. All rights reserved.
Leverage Hadoop?• A distributed filesystem is a beautiful crutch to lean on!
• Loading data at scale by itself is not Solr’s strength.
• Hadoop will push Solr to it’s limits and beyond.
‹#›© Cloudera, Inc. All rights reserved.
"As a good policeman I have all open source ‘guns’ for code checking available."Uwe Schindler @thetaph1 @uwesays
https://code.google.com/p/forbidden-apis/
http://labs.carrotsearch.com/randomizedtesting.html
top related