Download - The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics
1 ©MapR Technologies - Confidential
The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics
2 ©MapR Technologies - Confidential
Evolution of Search
Documents
•Models
•Feature Selection
User Interaction
•Clicks
•Ratings/Reviews
•Learning to Rank
•Social Graph
Queries
•Phrases
•NLP
Content Relationships
•Page Rank, etc.
•Organization
3 ©MapR Technologies - Confidential
Search Discovery and Analytics
Search
Discovery Analytics
4 ©MapR Technologies - Confidential
Data Volume Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
Data is Growing Quickly
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC Digital Universe
Study 2011
Data is Growing Faster than Moore’s Law
5 ©MapR Technologies - Confidential
MapReduce: A Paradigm Shift
Distributed computing platform
– Large clusters
– Commodity hardware
Pioneered at Google
– Bigtable and Google File System
Commercially available as Hadoop
6 ©MapR Technologies - Confidential
Hadoop Explosion
6
7 ©MapR Technologies - Confidential
How does Map/Reduce work?
1. Map
– Spread data across servers based on key/value pairs
– Each node independently scans local data
2. Servers produce Map results
3. Reduce - combine/merge Map results
4. Process complete or Map a new function
Like shuffling multiple decks of playing cards
8 ©MapR Technologies - Confidential
The Cost of Enterprise Storage
SAN Storage
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes 200,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
400,000 IOPS 2Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets: 20 Petabytes
10,000,000 IOPS 800 Gbytes/sec
9 ©MapR Technologies - Confidential
Deep Object Store
Billions and Billions of Files
For some use cases it’s not the storage capacity it’s the number of objects – Messages
– Attachments
– Images
– Recordings
Provides a deep storage pool that is analytic ready – Store it until you need it
– Derive secondary value from analytic processing
Makes more sense to perform analytics on the data and send results over the network
9
10 ©MapR Technologies - Confidential
Problems with Integrating Solr with Hadoop
Simple to integrate with Hadoop as a data source
Difficult to integrate distributed search and scale
SolrCloud simplifies Sharding and Replication coordination
Integration limitations based on capabilities of large scale storage
– High availability
– Data protection
– Ease of Access
11 ©MapR Technologies - Confidential
Sharded text Indexing
Map
Reducer
Input documents
Local disk Search
Engine
Local disk
Clustered index storage
Assign documents to shards
Index text to local disk and then copy index to
distributed file store
Copy to local disk typically required before
index can be loaded
12 ©MapR Technologies - Confidential
Problems with Solr and Hadoop
Map
Reducer
Input documents
Local disk Search
Engine
Local disk
Clustered index storage
Failure of a reducer causes garbage to accumulate in the
local disk
Failure of search engine requires
another download of the index from clustered storage.
13 ©MapR Technologies - Confidential
Limitations of HDFS
NAS appliance
NameNode
A B
DataNode DataNode DataNode
DataNode DataNode DataNode
DataNode DataNode DataNode
HDFS is Append Only
Data Access is through the HDFS API
High Availability is a challenge
Single points of failure
Limited to 50-200 million files
Performance bottleneck
14 ©MapR Technologies - Confidential
Logs, Flume, aggregates incoming events to Solr –Requires Multi-Step, Batch Process
Hadoop Cluster Application
Server
Application Server
Application Server
15 ©MapR Technologies - Confidential
What’s Required for SDA?
Ease of Data Access through Open Standards
Large Scale, Reliable Storage
Ease of Integration
– Management ( REST)
– Security (LDAP, NIS, Linux PAM…)
– Analytics (NFS, ODBC, HDFS)
Search
Discovery Analytics
16 ©MapR Technologies - Confidential
Ease of Data Access
ENTERPRISE NFS Access
HDFS API
17 ©MapR Technologies - Confidential
Multiple Architectures Possible
Export to the world
– NFS gateway runs on selected gateway hosts
Local server
– NFS gateway runs on local host
– Enables local compression and check summing
Export to self
– NFS gateway runs on all data nodes, mounted from localhost
18 ©MapR Technologies - Confidential
Data Access through Standard Protocols
NFS Server
NFS Server
NFS Server
NFS Server NFS
Client
19 ©MapR Technologies - Confidential
Client
NFS Server
NFS Access through a Local server
Application
Cluster Nodes
20 ©MapR Technologies - Confidential
Cluster Node
NFS Server
Universal export to self
Task
Cluster Nodes
21 ©MapR Technologies - Confidential
Cluster Node
NFS Server
Task
Cluster Node
NFS Server
Task
Cluster Node
NFS Server
Task
Nodes are identical
22 ©MapR Technologies - Confidential
Search Engine
Simplifies Solr Hadoop Integration
Map
Reducer
Input documents
Clustered index storage
Failure of a reducer is cleaned up by
map-reduce framework
Search engine reads mirrored index directly.
23 ©MapR Technologies - Confidential
How Does this Integration Happen?
Elegantly simple
Direct Integration a result of leveraging architectures
Data in the Hadoop cluster is written to a Volume
Solr Crawler discovers content being entered into Hadoop
Accesses the data in the cluster through NFS
Builds Search Index
Users access Solr to find data directly into Hadoop
24 ©MapR Technologies - Confidential
Distributed Shard Indexing
24
Input Map Combine Shuffle and sort
Reduce Output
Reduce
doc1 doc2 doc3
shard#1,doc1 shard#2,doc2 shard#1,doc3 shard#3,doc4 shard#3,doc5 …
shard#1,[doc3,doc1] shard#2,[doc2] shard#3, [doc5] …
index/s1 index/s2 index/s3 …
25 ©MapR Technologies - Confidential
How Does this Work at Scale with Distributed Indices?
MapReduce jobs analyze distributed, disparate data in a cluster
In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created.
Mapper assigns document to shard
– Shard is usually hash of document id
Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
Zookeeper is used to manage Solr instances
A large Solr Search is distributed across multiple shards
26 ©MapR Technologies - Confidential
What about HA and Data Protection?
Automated re-replication
Self-healing from HW and SW failures
Load balancing
Rolling upgrades
No lost jobs or data
99999’s of uptime
Reliable Compute Dependable Storage
Business continuity with snapshots and mirrors
Recover to a point in time
End-to-end check summing
Strong consistency
Mirror across sites to meet Recovery Time Objectives
Cluster Capabilities can Extend to Integrated Search and Discovery
27 ©MapR Technologies - Confidential
MapReduce failure to write the Index
Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion
MapReduce will clean up partially written indexes
No administrator intervention required
28 ©MapR Technologies - Confidential
Solr Node Fails
Other Solr nodes start serving shards that were being served by failed node
29 ©MapR Technologies - Confidential
Node Containing the Index Fails
Data is already replicated across the cluster
Zookeeper assigns Solr instance on the replicated node to the replicated shard
30 ©MapR Technologies - Confidential
Additional High Availability and Replication
Snapshots are available
Administrator sets frequency at the Volume
Snapshots with automatic de-duplication
Saves space by sharing blocks
Redirect on write, fast with no performance or storage penalty
Zero performance loss on writing to original
Scheduled, or on-demand
Easy recovery with drag and drop
31 ©MapR Technologies - Confidential
Mirroring Support in Hadoop Cluster
EC2
Business Continuity and Efficiency
Efficient design
Differential deltas are updated
Compressed and check-summed
Easy to manage
Scheduled or on-demand
WAN, Remote Seeding
Consistent point-in-time
WAN Datacenter 2
Production Research
Production WAN
Datacenter 1
32 ©MapR Technologies - Confidential
Simplified NFS data flows for Distributed Search
Map
Reducer
Input documents
Search Engine
Mirrors
Search Engine
Mirroring allows exact placement
of index data
Aribitrary levels of replication also possible
33 ©MapR Technologies - Confidential
Improving Search Relevancy
Requires a continuous Feedback Loop
–The quality of the search is influenced by the end-user selections
–Fully automated process that improves with use
–Does not require manual tags or classification
Search
Discovery Analytics
34 ©MapR Technologies - Confidential
Recommendations
Often referred to as collaborative filtering
Actors interact with items
– observe successful interaction
We want to suggest additional successful interactions
Observations inherently very sparse
35 ©MapR Technologies - Confidential
Examples
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s
36 ©MapR Technologies - Confidential
Examples
Query for Friends results in links to Seinfeld
Search for kittens, get results for baby otters
37 ©MapR Technologies - Confidential
Dyadic Structure
Functional
– Interaction: actor -> item*
Relational
– Interaction ⊆ Actors x Items
Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
Predict missing observations
38 ©MapR Technologies - Confidential
Fundamental Algorithmics
Co-occurrence
A is actors x items, K is items x items
Product has general shape of matrix
K tells us “users who interacted with x also interacted with y”
39 ©MapR Technologies - Confidential
Why not Expand it?
Users enter queries (A)
– (actor = user, item=query)
Users view videos (B)
– (actor = user, item=video)
A’A gives query recommendation
– “did you mean to ask for”
B’B gives video recommendation
– “you might like these videos”
40 ©MapR Technologies - Confidential
The punch-line
B’A recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
41 ©MapR Technologies - Confidential
Real-life example
Query: “Paco de Lucia”
Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
42 ©MapR Technologies - Confidential
Real-life example
43 ©MapR Technologies - Confidential
The Search for Relevancy
Updating Search to Reflect Relevancy
– Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance
The power of this virtuous loop depends on ease of frictionless data access, high availability, performance
Search
Discovery Analytics