hortonworks technical workshop - hdp search

33
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search Workshop Hortonworks. We do Hadoop. 1/29/2013

Upload: hortonworks

Post on 15-Jul-2015

1.705 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Hortonworks Technical Workshop - HDP Search

Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP Search Workshop Hortonworks. We do Hadoop.

1/29/2013

Page 2: Hortonworks Technical Workshop - HDP Search

Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Agenda

•  Hortonworks Data Platform 2.2

•  Apache Solr

•  Query & Ingest Documents with Apache Solr

•  Solr & Hadoop

•  Index on HDFS

•  MapReduce, Hive & Pig

•  Solr Cloud

•  Sizing

•  Demo

Page 3: Hortonworks Technical Workshop - HDP Search

Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP 2.2

Page 4: Hortonworks Technical Workshop - HDP Search

Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP delivers a comprehensive data management platform

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization Accounting

Data Protection

Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon

Cluster: Knox Cluster: Ranger

Deployment Choice Linux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Delivered Completely in the OPEN

Page 5: Hortonworks Technical Workshop - HDP Search

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP 2.2: Reliable, Consistent & Current

HDP is Apache Hadoop not “based on” Hadoop

Page 6: Hortonworks Technical Workshop - HDP Search

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP Search

HDP 2.2 contains support for: •  Apache Solr 4.10 with Lucense •  Banana (Time-series visualization) •  Lucidworks Hadoop connector

Page 7: Hortonworks Technical Workshop - HDP Search

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Solr

Page 8: Hortonworks Technical Workshop - HDP Search

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

What is Apache Solr

•  A system built to search text

•  A specialized type of database management System

•  A platform to build search applications on

•  Customizable, open source software

Page 9: Hortonworks Technical Workshop - HDP Search

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Why Apache Solr

Specialized tools do the job better!

•  Solr performs much better, for text search, than a relational database

•  Solr knows about languages

» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς

•  Solr has features specific to text search,

» E.g. highlighting search results

Page 10: Hortonworks Technical Workshop - HDP Search

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Where does Apache Solr fit?

Page 11: Hortonworks Technical Workshop - HDP Search

Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Solr’s Architecture

Page 12: Hortonworks Technical Workshop - HDP Search

Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Solr’s inner Architecture

Page 13: Hortonworks Technical Workshop - HDP Search

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Basics Of Inverted Index

Doc ID Content

1 I like dog

2 I like cat

3 I like dog and cat1

Term Doc ID

I 1,2,3

like 1,2,3

dog 1,3

cat 2,3

Document Inverted Index

Question Find documents with Dog & Cat

Answer

Intersect the index for dog and cat

(1,3) (2,3) = 3

Page 14: Hortonworks Technical Workshop - HDP Search

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

SOLR Indexing

•  Define document structure using

schema.xml

•  Convert document from source

format to a format supported by solr

(xml, json, csv)

•  Add Documents to SOLR

<doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field>

</doc>

Sample Document in XML Format

Page 15: Hortonworks Technical Workshop - HDP Search

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr’s schema & fields

Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: -  Fields -  Field used as the unique/primary key -  Field type -  How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>

Page 16: Hortonworks Technical Workshop - HDP Search

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Dynamic Fields

Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml

Page 17: Hortonworks Technical Workshop - HDP Search

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Query & Index Documents with Solr

Page 18: Hortonworks Technical Workshop - HDP Search

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Delete from Index:curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'

curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)

Page 19: Hortonworks Technical Workshop - HDP Search

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

How To Query

Solr offers a REST like interface for indexing and searching:

Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true

Page 20: Hortonworks Technical Workshop - HDP Search

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Java API (SolrJ)

Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit();

Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }

Page 21: Hortonworks Technical Workshop - HDP Search

Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr & Hadoop

Page 22: Hortonworks Technical Workshop - HDP Search

Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP Search : Deployment Options

Page 22

Configuration Advantages Disadvantages

Solr deployed in an independent cluster

•  Scale independently •  Scale easily for increased query volume •  No need to carefully orchestrate resource

allocations among workloads, indexing, and querying

•  Multiple clusters to admin and manage

Solr index deployed on HDFS node

•  Single cluster to administration / manage •  Leverages Hadoop file system advantages •  Not supported for kerberized cluster

Page 23: Hortonworks Technical Workshop - HDP Search

Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

How to store Solr’s index on HDFS?

Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType>

Page 23

1

2

Page 24: Hortonworks Technical Workshop - HDP Search

Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job

– CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC

• Apache Pig & Hive – Write your own pig/hive scripts

to index content – Use hive/pig for

preprocessing and joining – Output the resulting datasets

to Solr

HDFS

MapReduce or Pig Job

Solr

Raw Documents Lucene Indexes

Page 25: Hortonworks Technical Workshop - HDP Search

Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Ingest csv files using Map Reduce

Scenario I: Ingest CSV data stored on local disk [email protected]:/root/csvjava -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://172.16.227.204:8983/solr

Page 25

Page 26: Hortonworks Technical Workshop - HDP Search

Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Query Solr Index in Hive

Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*');

SELECT user,heartrate FROM solr

Page 26

Page 27: Hortonworks Technical Workshop - HDP Search

Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Index existing Hive-data

Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy)

CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;

INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr;

Page 27

Page 28: Hortonworks Technical Workshop - HDP Search

Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS

Data.pigREGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';set solr.collection '$collection';A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l:long,user_s:chararray);–  — ID comes first, then field name, valueB = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5;

ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();

pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig

Page 28

Page 29: Hortonworks Technical Workshop - HDP Search

Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Solr Cloud

Page 30: Hortonworks Technical Workshop - HDP Search

Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

SolrCloud

Apache Solr includes Fault Tolerance & High Availability:

SolrCloud

•  Distributed indexing

•  Distributed search

•  Central configuration for the entire cluster

•  Automatic load balancing and fail-over for queries

•  ZooKeeper integration for cluster coordination and configuration.

Page 31: Hortonworks Technical Workshop - HDP Search

Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing

Page 32: Hortonworks Technical Workshop - HDP Search

Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sizing Guidelines (Handle With Care)

• 100-250 Million docs per solr server

• 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM

• Queries/s in double digit ms response time up to 30.

Page 33: Hortonworks Technical Workshop - HDP Search

Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Demo