Download - SolrCloud on Hadoop
![Page 1: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/1.jpg)
1
SolrCloud on Hadoop Cleveland Big Data and Hadoop User Group, January 2014 Alex Moundalexis [email protected] @technmsg
![Page 2: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/2.jpg)
Disclaimer
• Technologies, not products • Cloudera builds things soHware
• most donated to Apache • some closed-‐source
• I will likely menMon “Cloudera Something” • Cloudera “products” I reference are open source
• Apache Licensed • Source code is on GitHub
• hRps://github.com/cloudera
2
![Page 3: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/3.jpg)
What This Talk Isn’t About
• Deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• Sizing & Tuning • Depends heavily on data and workload
• Coding • Unless you count XML or CSV
• Algorithms
3
![Page 4: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/4.jpg)
4
Quick and dirty, more Mme for use cases.
The Apache Hadoop Ecosystem
![Page 5: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/5.jpg)
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaMons • ConfiguraMon • Workflow
5
![Page 6: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/6.jpg)
ParMal Ecosystem
6
Hadoop
external system
RDBMS / DWH
web server
device logs
API access
log collecMon
DB table import
batch processing
machine learning
external system
API access
user
RDBMS / DWH
DB table export
BI tool + JDBC/ODBC
Search
SQL
![Page 7: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/7.jpg)
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpMmized for large streaming access to data • Based on Google File System
• hRp://research.google.com/archive/gfs.html
7
![Page 8: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/8.jpg)
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
![Page 9: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/9.jpg)
MapReduce (MR)
• Programming paradigm • Batch oriented, not realMme • Works well with distributed compuMng • Lots of Java, but other languages supported • Based on Google’s paper
• hRp://research.google.com/archive/mapreduce.html
9
![Page 10: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/10.jpg)
Under the Covers
![Page 11: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/11.jpg)
You specify map() and reduce() functions. ���
���The framework does the
rest. 60
![Page 12: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/12.jpg)
Apache HBase
• Random, realMme read/write access • Key/value columnar store • (b|tr)illions of rows/columns • Based on Google BigTable
• hRp://research.google.com/archive/bigtable.html
12
![Page 13: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/13.jpg)
Cloudera Hue
• Hadoop User Experience • Hadoop is largely command line • Hue provides a UI for end-‐users • SDK to build your own apps on top
13
![Page 14: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/14.jpg)
Apache Tika
• Content analysis toolkit • Simply put, a lot of parsers • Detect/extract metadata/text from documents
• HTML • XML • Office • PDF • mbox • More…
14
![Page 15: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/15.jpg)
Apache ZooKeeper
• Distributed systems are HARD • Everyone was trying to implement the same subsystems • Bugs leads to race condiMons, other bad things
• ZK: Highly reliable distributed coordinaMon services • ConfiguraMon • Naming • SynchronizaMon • Group Services
15
![Page 16: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/16.jpg)
Cloudera Morphlines
• In-‐memory transformaMons • Load, parse, transform, process • Records as name-‐value pairs w/ opMonal blob/pojo objects
• Java library, embedded in your codebase • Used to ETL data from Flume and MR into Solr
• Was part of CDK, now part of Kite • hRp://kitesdk.org
16
![Page 17: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/17.jpg)
Apache Lucene
• Java-‐based index and search • ranked or sorted results • hits streamed through QP
• mem(results) < mem(collecMon)
• rich/extensible query operators • bool, phrase, range, span, spaMal
• Features • spellchecking • hit highlighMng • tokenizaMon
17
![Page 18: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/18.jpg)
Apache Solr
• Enterprise search plaporm • Based on Apache Lucene
• Full-‐text search • FaceMng • NRT indexing • UI
18
![Page 19: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/19.jpg)
Apache Solr – Simple Indexing via CLI
$ java -‐jar post.jar solr.xml money.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file solr.xml SimplePostTool: POSTing file money.xml SimplePostTool: COMMITting Solr index changes.. $ post.sh *.xml
19
![Page 20: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/20.jpg)
Apache Solr – Document money.xml
<add> <doc> <field name="id">USD</field> <field name="name">One Dollar</field> <field name="manu">Bank of America</field> <field name="manu_id_s">boa</field> <field name="cat">currency</field> <field name="features">Coins and notes</field> <field name="price_c">1,USD</field> <field name="inStock">true</field> </doc> <doc> <field name="id">EUR</field> <field name="name">One Euro</field>
20
![Page 21: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/21.jpg)
Apache Solr – More Advanced Indexing
• From DB, using Data Import Handler (DIH) • Load a CSV file • POST JSON documents • Index binary documents (uses Tika) • SolrJ for programmaMc document creaMon
21
![Page 22: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/22.jpg)
Apache Solr – Querying
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/
• Examples • ?q=Mmestamp:[* TO NOW] • ?q=-‐instock:false • ?q={!lucene q.op=AND df=text}myfield:foo +bar -‐bat
22
![Page 23: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/23.jpg)
Apache Solr – Querying
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/?q=video
• Examples • &fl=name,id (return only name and id fields) • &fl=name,id,score (return relevancy score as well) • &fl=*,score (return all fields + relevancy score) • &sort=price desc&fl=name,id,price (sort by price desc) • &wt=json (return response in JSON format)
23
![Page 24: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/24.jpg)
What the Heck is FaceMng?
• Generate counts for properMes or categories • Links allow drill-‐down or refine search results
What?
24
![Page 25: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/25.jpg)
Facets on Amazon.com
25
![Page 26: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/26.jpg)
Apache Solr – Facets at Query Time
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/?q=video • All docs, count by category q=*:*&facet=true&facet.field=cat
• All docs, count by category and in-‐stock status q=*:*&facet=true&facet.field=cat&facet.field=inStock
• Docs matching “ipod”, count by price (above/below $100) q=ipod&facet=true&facet.query=price:[0 TO 100]&facet.query=price:[100 TO *]
26
![Page 27: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/27.jpg)
Apache Solr – Querying via UI
27
![Page 28: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/28.jpg)
Apache SolrCloud
• IntegraMon of Solr + ZooKeeper • Provides for shard failover
28
![Page 29: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/29.jpg)
Cloudera Search
• Based on Apache Solr (incl Lucene and SolrCloud) • Fault-‐tolerance: collecMons backed by HDFS or HBase • IntegraMon galore:
• HBase/Flume/MapReduce w/ Lucene • Hue w/ Solr • Avro w/ Tika • HDFS w/ Solr/Lucene • Sentry w/ Solr
29
![Page 30: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/30.jpg)
Cloudera Search + Hue
30
![Page 31: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/31.jpg)
Cloudera Search + Hue
31
![Page 32: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/32.jpg)
32
Apologies, I swiped some preRy slides from markeMng…
Why Search?
![Page 33: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/33.jpg)
Search Design Strategy
33
One pool of data
One security framework
One set of system resources
One management interface
An Integrated Part of the Hadoop System
Storage
Integra5on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
InteracMve SQL
CLOUDERA IMPALA
InteracMve Search CLOUDERA SEARCH
Machine Learning MAHOUT
Math & Sta5s5cs
SAS, R
![Page 34: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/34.jpg)
Benefits of Search IntegraMon
34
Improved Big Data ROI § An interacMve experience without technical knowledge § Single data set for mulMple compuMng frameworks
Faster Time to Insight § Exploratory analysis, esp. unstructured data § Broad range of indexing opMons to accommodate needs
Cost Efficiency § Single scalable plaporm; no incremental investment § No need for separate systems, storage
Solid Founda5ons & Reliability § Solr in producMon environments for years § Hadoop-‐powered reliability and scalability
![Page 35: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/35.jpg)
35
Some quick examples.
Search Use Cases
![Page 36: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/36.jpg)
Search Use Cases
36
Offer easy access to non-‐technical resources
Explore data prior to processing and modeling
Gain immediate access and find correlaMons in mission-‐criMcal data
Powerful, proven search capabili5es that let organiza5ons:
![Page 37: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/37.jpg)
Monsanto
37
Scalable, efficient image search for analysis and research
Track plant characterisMcs throughout their lifecycle
Before: Manual aRribute extracMon and search queries within database
Now: Parse and index images at acquisiMon and on demand, index archived images in batch
![Page 38: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/38.jpg)
38
Cloudera: Internal Field Portal
Custom Aggregated Search
![Page 39: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/39.jpg)
Cloudera – Internal Field Portal
• Single stop for field engineers • Mailing lists: public, private • Tickets: support, development, public ASF • Customer data: accounts, clusters, KB arMcles • Customer Clusters: configs, audits, logs, events • Books and papers • Discussion forums
• Dogfooding, yes • Makes my life easier
39
![Page 40: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/40.jpg)
Cloudera – Internal Field Portal
40
![Page 41: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/41.jpg)
Cloudera – Internal Field Portal
• Varied fetchers/observers for web/API content • Content is retrieved via Flume, Sqoop
• Search indexes and replicates into HBase • Each collecMon has collecMon-‐specific filters/fields • Provides Mtle, content snippet, link to original
• Morphlines extracts books and papers using Tika • Impala for analyMcs
• Future: Use MapReduce to ingest logs
41
![Page 42: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/42.jpg)
42
ParMng thoughts… in no parMcular order.
Summary
![Page 43: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/43.jpg)
Search Simplifies InteracMon
43
Explore
Navigate
Correlate Experts know MapReduce. Savvy people know SQL.
Everyone knows Search.
![Page 44: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/44.jpg)
Summary
• With Hadoop, it depends. • The tools are out there. • Open source soHware, hooray!
• Many interconnected pieces • Many unexplored opportuniMes • A thriving community awaits you…
• Data can make a difference. • Search allows everyone to interact with data.
• This is a Big Deal.
44
![Page 45: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/45.jpg)
What’s Next?
• Search examples • hRp://blog.cloudera.com/blog/category/search/
• Cloudera provides pre-‐loaded VMs • hRp://Mny.cloudera.com/quickstartvm
• Clone our repos! • hRps://github.com/cloudera
45
![Page 46: SolrCloud on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022081400/554a0caeb4c905825d8b4648/html5/thumbnails/46.jpg)
46
Preferably related to the talk…
QuesMons?