big data analysis patterns - trihug 6/27/2013
DESCRIPTION
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.TRANSCRIPT
1
Big DataAnalysis PatternsTriHUG6/27/2013
2
whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• [email protected]
3
BIG DATA
4
5
Big Data is not new!but the tools are.
6
The Good News in Big Data:
“Simple algorithms and lots of data trump complex models”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
7
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?
7
8
Ask a Different Question
It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?
8
9
Picking the Best Solution
Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try
But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.
9
10
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data
Solr is a small data tool that has flourished in a big data world
11
Apache Mahout
Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.
Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification
Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr
12
Apache Drill
Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …
Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?
Bypasses MapReduce
13
Storm
Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic
“The Hadoop of Realtime”
14
Titan
Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB
Search Integrated– Solr/Lucene– Elastic Search
Faunus– Graph traversals on subset– In-memory
15
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?
16
Big Data Decision Tree
How big is your data?
<10 GB >200 GBmid
What size queries?
Single element at a time
One passover 100%
Multiple passesover big chunks
Big storage Streaming
Response time?
< 100s(human scale)
throughputnot response
A
B C
ED
??
17
Use Cases Company Data Shape Technique(s) Business Value
18
Business Value
19
Business Value
20
Telecommunications Giant
ETL Offload
21
Lots of Data Lots of Queries across Large Sets Throughput important
Data ShapeTelecommunications
22
Techniques
AnalyticsETL
Telecommunications
23
Techniques
+
ETL (Hadoop) Analytics (Teradata)
Telecommunications
24
Business ValueTelecommunications
25
Credit CardIssuer
26
Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations
Data Shape
Credit CardIssuer
27
History matrix
One row per user
One column per thing
Search Abuse
A Recommendation Engine with Mahout and Solr/Lucene
Techniques
28
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
Techniques
29
Cooccurrence matrix can also be implemented as a search index
Techniques
30
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
Techniques
20 Hrs 3 Hrs
31
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
Techniques
8Hrs 3 Min
32
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
PresentationData Store
(DB2)
App
App
App
App
App
Hadoop Export(4 hrs)
Import(4 hrs)
33
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
RecommendationSearch Index
(Solr)
App
App
App
App
App
Hadoop
IndexUpdate(3 min)
34
Business Value
35
Idle Alerts
Waste & Recycling Leader
36
Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)
Landfill Geographic Boundaries
Data Shape
37
Techniques
TruckGeolocation
Data
Realtime Stream Computation(Storm)
Batch Computation(MapReduce)
ImmediateAlerts
Tax ReductionReporting
HadoopStorage
Shortest PathGraph Algorithm
(Titan)
Route Optimization
38
Business Value
39
Social Engagement Application
Beverage Company
40
Tweets, FB Messages Person, Activity links Graph Traversal
Data Shape
41
Consumer Activity Graph
Wal*Mart.com
CVS
Dollar General
Ebay
Ebay Motors
Toys R UsStubHub
Shopping.comSam’s
42
Techniques
Property Graph(Titan)
Key/Value Store(MapR M7)
Social Activity Stream
Graph Traversal(Faunus)
43
Business Value
44
Questions?