hortonworks technical workshop - hdp search

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP Search Workshop Hortonworks. We do Hadoop.

1/29/2013


Agenda

•  Hortonworks Data Platform 2.2

•  Apache Solr

•  Query & Ingest Documents with Apache Solr

•  Solr & Hadoop

•  Index on HDFS

•  MapReduce, Hive & Pig

•  Solr Cloud

•  Sizing

•  Demo


HDP 2.2


HDP delivers a comprehensive data management platform

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization Accounting

Data Protection

Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon

Cluster: Knox Cluster: Ranger

Deployment Choice Linux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Delivered Completely in the OPEN


HDP 2.2: Reliable, Consistent & Current

HDP is Apache Hadoop not “based on” Hadoop


HDP Search

HDP 2.2 contains support for: •  Apache Solr 4.10 with Lucense •  Banana (Time-series visualization) •  Lucidworks Hadoop connector


Apache Solr


What is Apache Solr

•  A system built to search text

•  A specialized type of database management System

•  A platform to build search applications on

•  Customizable, open source software


Why Apache Solr

Specialized tools do the job better!

•  Solr performs much better, for text search, than a relational database

•  Solr knows about languages

» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς

•  Solr has features specific to text search,

» E.g. highlighting search results


Where does Apache Solr fit?


Apache Solr’s Architecture


Apache Solr’s inner Architecture


Basics Of Inverted Index

Doc ID Content

1 I like dog

2 I like cat

3 I like dog and cat1

Term Doc ID

I 1,2,3

like 1,2,3

dog 1,3

cat 2,3

Document Inverted Index

Question Find documents with Dog & Cat

Answer

Intersect the index for dog and cat

(1,3) (2,3) = 3


SOLR Indexing

•  Define document structure using

schema.xml

•  Convert document from source

format to a format supported by solr

(xml, json, csv)

•  Add Documents to SOLR

<doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field>

</doc>

Sample Document in XML Format


Solr’s schema & fields

Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: -  Fields -  Field used as the unique/primary key -  Field type -  How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>


Dynamic Fields

Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml


Query & Index Documents with Solr


Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Delete from Index:curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'

curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)


How To Query

Solr offers a REST like interface for indexing and searching:

Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true


Solr Java API (SolrJ)

Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit();

Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }


Solr & Hadoop


HDP Search : Deployment Options

Page 22

Configuration Advantages Disadvantages

Solr deployed in an independent cluster

•  Scale independently •  Scale easily for increased query volume •  No need to carefully orchestrate resource

allocations among workloads, indexing, and querying

•  Multiple clusters to admin and manage

Solr index deployed on HDFS node

•  Single cluster to administration / manage •  Leverages Hadoop file system advantages •  Not supported for kerberized cluster


How to store Solr’s index on HDFS?

Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType>

Page 23

1

2


Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job

– CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC

• Apache Pig & Hive – Write your own pig/hive scripts

to index content – Use hive/pig for

preprocessing and joining – Output the resulting datasets

to Solr

HDFS

MapReduce or Pig Job

Solr

Raw Documents Lucene Indexes


Ingest csv files using Map Reduce

Scenario I: Ingest CSV data stored on local disk [email protected]:/root/csvjava -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://172.16.227.204:8983/solr

Page 25


Query Solr Index in Hive

Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*');

SELECT user,heartrate FROM solr

Page 26


Index existing Hive-data

Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy)

CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;

INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr;

Page 27


Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS

Data.pigREGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';set solr.collection '$collection';A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l:long,user_s:chararray);–  — ID comes first, then field name, valueB = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5;

ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();

pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig

Page 28


Solr Cloud


SolrCloud

Apache Solr includes Fault Tolerance & High Availability:

SolrCloud

•  Distributed indexing

•  Distributed search

•  Central configuration for the entire cluster

•  Automatic load balancing and fail-over for queries

•  ZooKeeper integration for cluster coordination and configuration.


Sizing


Sizing Guidelines (Handle With Care)

• 100-250 Million docs per solr server

• 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM

• Queries/s in double digit ms response time up to 30.


Demo

hortonworks technical workshop - hdp search

Technology