building enterprise search engines using open source technologies

20
www.anant.us | [email protected] | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Large Scale Search with Open Source Technologies Building Search Engines

Upload: anant-corporation

Post on 15-Apr-2017

1.433 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Building Enterprise Search Engines using Open Source Technologies

 www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 

Large Scale Search with Open Source Technologies

Building Search Engines

Page 2: Building Enterprise Search Engines using Open Source Technologies

What do we do?

Streamline, Organize & Unify

Business Information

Page 3: Building Enterprise Search Engines using Open Source Technologies

Agenda

•Challenge - Why does this matter?•Search Engine - 30k Foot View•Open - Lucene, Cassandra & Spark•Customizing - Apache Lucene/SolR•Custom Parser - Written in Scala

Page 4: Building Enterprise Search Engines using Open Source Technologies

Challenge – Why does this matter?

Knowledge

Project Informatio

n

Client Service

InformationCorporate

Guides

Collaborative

Documents

Assets& Files

Corporate Resources

Appleseed Framework (Portal, Base, Search)

G Drive Delta

DropBox

G Drive Delta

NutshellDropbox

Freshbooks

G DriveG Sites

(KB)G Drive

WorkflowyEvernote

G DriveDropBox

OwnCloud

PocketLeaves

AIC (WP)Anant (WP)

Page 5: Building Enterprise Search Engines using Open Source Technologies

Search Engine – 30 Thousand Foot View

The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search. 

Page 6: Building Enterprise Search Engines using Open Source Technologies

Lucene – More than meets the eye

WhoNext?

Think of it like a “NoSQL” Database that has great indexing.. everywhere.

Page 7: Building Enterprise Search Engines using Open Source Technologies

Cassandra – NoSQL With Structure

WhoNext?

Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join. 

Page 8: Building Enterprise Search Engines using Open Source Technologies

Spark – Way Better MapReduce

WhoNext?

Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster. 

Page 9: Building Enterprise Search Engines using Open Source Technologies

Configuring - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. 

• Configuration - Schema–Data Types–Pre-Processing –Collection  Definitions–Managed vs. Unmanaged

• Configuration - ZooKeeper–Synchronize Configurations–Distribute Shards–Manage Replicas–Elect Leaders

• Configuration - SolrConfig–Handlers–Components–Indexing Configurations–Memory / Cache–File System

• Lessons Learned–Try to use out of the box–Try to configure your way –Make sure to upgrade–Not everything can be configured

Page 10: Building Enterprise Search Engines using Open Source Technologies

Configuring - SolR - 2/3

• Before Docker –Setup Zookeeper 

•Customize zoo.cfg•Setup Zookeeper Servers

–Setup SolR Distro•Download SolR•Clean up SolR•Customize Schema.xml•Customize SolrConfig.xml•Setup Different Solr Servers

–Start the Cloud•Custom Start Scripts

• Today w/ Docker – docker run --name zookeeper \

-p 127.0.0.1:2181:2181 \-p 127.0.0.1:2888:2888 \-p 127.0.0.1:3888:3888 \jplock/zookeeper

– docker run --link zookeeper:ZK -i \-p 127.0.0.1:8983:8983 \-t dockerimages/docker-solr \ /bin/bash -c '\cd /opt/solr/example; \java -jar \-Dbootstrap_confdir=./solr/collection1/conf \-Dcollection.configName=myconf \   -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT \-DnumShards=2 \start.jar';

https://hub.docker.com/r/dockerimages/docker-solr/

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

Page 11: Building Enterprise Search Engines using Open Source Technologies

Configuring - SolR - 3/3

• SolrConfig - Example • Schema - Example

https://cwiki.apache.org/confluence/display/solr/Configuring+solrconfig.xml

https://wiki.apache.org/solr/SchemaXml

Page 12: Building Enterprise Search Engines using Open Source Technologies

SolR Cloud / Zookeeper

Page 13: Building Enterprise Search Engines using Open Source Technologies

User Interface - Super Advanced

Page 14: Building Enterprise Search Engines using Open Source Technologies

Customizing - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. 

• Customization - Parsing–Need Specialized Syntax?–Java or Scala Based–Open Plugin Structure–Several Examples

• Customization - Highlighting–Need Special Coloring?–Specialized Syntax Aware–Open Plugin Structure–Several Examples

• Customization - Term Counts–Need Specific Information?–Create the Logic–Register the Component–Complicated Examples

• Lessons Learned–Major version upgrades = pain–Newer classes can be extended better

–Long term investment

Page 15: Building Enterprise Search Engines using Open Source Technologies

Customizing - SolR - 2/3

• Custom Component in Scala or Java • Installing the Component

http://wiki.apache.org/solr/SolrPlugins http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html

Page 16: Building Enterprise Search Engines using Open Source Technologies

Customizing - SolR - 3/3

Page 17: Building Enterprise Search Engines using Open Source Technologies

Creating a Custom Parser with ScalaBuilding a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is. 

• Why a Specialized Syntax?–Legacy Syntax–Boolean AND Proximity Queries–Specialized Fielded Expressions–Ranges / Classifications

• Why not ANTLR or JavaCC?–Old Parser was in Parboiled(1)–Parboiled2 was in Scala–No need to learn a separate Syntax for Creating Syntax

• Lessons Learned–Parboiled2 Documentation = bad–Understand the syntax–Interactive REPL in Scala = good–Write tons of unit tests–Long term investment

• Customizing SolR with Scala–Found a good Virtual Mentor–Learned Scala (not for Spark)–Started from the ground up–Reduced from ~1k to 400 LOC

Page 18: Building Enterprise Search Engines using Open Source Technologies

JavaCC vs. parboiled2 (Scala)

• Java CC - SurroundQuery.jj • Scala based Parboiled2

Page 19: Building Enterprise Search Engines using Open Source Technologies

Questions & Contact

 www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 

@anantcorp

facebook.com/anantCorp

linkedin.com/company/anant

[email protected]/in/xingh

Rahul SinghCEO & Founder

Questions & Contact

• Brown Bag Session or Meetup?• Modern Enterprise• Mastering Services in the Service of Others• Hybrid Agile Project Management• Building Search Engines• CICD / DevOps• Connecting Internet Software

Page 20: Building Enterprise Search Engines using Open Source Technologies

 www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 

Streamlined DataIntegration / Data PipelinesOrganized Knowledge

Search / Data WarehousesUnified Interfaces

Portals / Dashboards / Mobile