hortonworks technical workshop - hdp search
TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Workshop Hortonworks. We do Hadoop.
1/29/2013
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Hortonworks Data Platform 2.2
• Apache Solr
• Query & Ingest Documents with Apache Solr
• Solr & Hadoop
• Index on HDFS
• MapReduce, Hive & Pig
• Solr Cloud
• Sizing
• Demo
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Delivered Completely in the OPEN
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2: Reliable, Consistent & Current
HDP is Apache Hadoop not “based on” Hadoop
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search
HDP 2.2 contains support for: • Apache Solr 4.10 with Lucense • Banana (Time-series visualization) • Lucidworks Hadoop connector
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Solr
• A system built to search text
• A specialized type of database management System
• A platform to build search applications on
• Customizable, open source software
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Apache Solr
Specialized tools do the job better!
• Solr performs much better, for text search, than a relational database
• Solr knows about languages
» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς
• Solr has features specific to text search,
» E.g. highlighting search results
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Where does Apache Solr fit?
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s Architecture
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s inner Architecture
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics Of Inverted Index
Doc ID Content
1 I like dog
2 I like cat
3 I like dog and cat1
Term Doc ID
I 1,2,3
like 1,2,3
dog 1,3
cat 2,3
Document Inverted Index
Question Find documents with Dog & Cat
Answer
Intersect the index for dog and cat
(1,3) (2,3) = 3
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SOLR Indexing
• Define document structure using
schema.xml
• Convert document from source
format to a format supported by solr
(xml, json, csv)
• Add Documents to SOLR
<doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field>
</doc>
Sample Document in XML Format
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr’s schema & fields
Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: - Fields - Field used as the unique/primary key - Field type - How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query & Index Documents with Solr
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'
Delete from Index:curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How To Query
Solr offers a REST like interface for indexing and searching:
Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Java API (SolrJ)
Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit();
Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr & Hadoop
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search : Deployment Options
Page 22
Configuration Advantages Disadvantages
Solr deployed in an independent cluster
• Scale independently • Scale easily for increased query volume • No need to carefully orchestrate resource
allocations among workloads, indexing, and querying
• Multiple clusters to admin and manage
Solr index deployed on HDFS node
• Single cluster to administration / manage • Leverages Hadoop file system advantages • Not supported for kerberized cluster
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType>
Page 23
1
2
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job
– CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC
• Apache Pig & Hive – Write your own pig/hive scripts
to index content – Use hive/pig for
preprocessing and joining – Output the resulting datasets
to Solr
HDFS
MapReduce or Pig Job
Solr
Raw Documents Lucene Indexes
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ingest csv files using Map Reduce
Scenario I: Ingest CSV data stored on local disk [email protected]:/root/csvjava -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://172.16.227.204:8983/solr
Page 25
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Solr Index in Hive
Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*');
SELECT user,heartrate FROM solr
Page 26
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Index existing Hive-data
Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy)
CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;
INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr;
Page 27
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS
Data.pigREGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';set solr.collection '$collection';A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l:long,user_s:chararray);– — ID comes first, then field name, valueB = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5;
ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();
pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig
Page 28
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Cloud
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SolrCloud
Apache Solr includes Fault Tolerance & High Availability:
SolrCloud
• Distributed indexing
• Distributed search
• Central configuration for the entire cluster
• Automatic load balancing and fail-over for queries
• ZooKeeper integration for cluster coordination and configuration.
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing Guidelines (Handle With Care)
• 100-250 Million docs per solr server
• 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM
• Queries/s in double digit ms response time up to 30.
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Demo