big data analysis patterns - trihug 6/27/2013

Post on 26-Jan-2015

108 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

TRANSCRIPT

1

Big DataAnalysis PatternsTriHUG6/27/2013

2

whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• banderson@maprtech.com

3

BIG DATA

4

5

Big Data is not new!but the tools are.

6

The Good News in Big Data:

“Simple algorithms and lots of data trump complex models”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

7

The Challenge: So Many Solutions!

What solutions fit your business problem?

For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?

7

8

Ask a Different Question

It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?

8

9

Picking the Best Solution

Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.

9

10

Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data

Solr is a small data tool that has flourished in a big data world

11

Apache Mahout

Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.

Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification

Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr

12

Apache Drill

Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …

Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?

Bypasses MapReduce

13

Storm

Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic

“The Hadoop of Realtime”

14

Titan

Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB

Search Integrated– Solr/Lucene– Elastic Search

Faunus– Graph traversals on subset– In-memory

15

Using the Answers to Guide Your Choices

For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?

16

Big Data Decision Tree

How big is your data?

<10 GB >200 GBmid

What size queries?

Single element at a time

One passover 100%

Multiple passesover big chunks

Big storage Streaming

Response time?

< 100s(human scale)

throughputnot response

A

B C

ED

??

17

Use Cases Company Data Shape Technique(s) Business Value

18

Business Value

19

Business Value

20

Telecommunications Giant

ETL Offload

21

Lots of Data Lots of Queries across Large Sets Throughput important

Data ShapeTelecommunications

22

Techniques

AnalyticsETL

Telecommunications

23

Techniques

+

ETL (Hadoop) Analytics (Teradata)

Telecommunications

24

Business ValueTelecommunications

25

Credit CardIssuer

26

Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations

Data Shape

Credit CardIssuer

27

History matrix

One row per user

One column per thing

Search Abuse

A Recommendation Engine with Mahout and Solr/Lucene

Techniques

28

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Techniques

29

Cooccurrence matrix can also be implemented as a search index

Techniques

30

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Techniques

20 Hrs 3 Hrs

31

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Techniques

8Hrs 3 Min

32

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

PresentationData Store

(DB2)

App

App

App

App

App

Hadoop Export(4 hrs)

Import(4 hrs)

33

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

RecommendationSearch Index

(Solr)

App

App

App

App

App

Hadoop

IndexUpdate(3 min)

34

Business Value

35

Idle Alerts

Waste & Recycling Leader

36

Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)

Landfill Geographic Boundaries

Data Shape

37

Techniques

TruckGeolocation

Data

Realtime Stream Computation(Storm)

Batch Computation(MapReduce)

ImmediateAlerts

Tax ReductionReporting

HadoopStorage

Shortest PathGraph Algorithm

(Titan)

Route Optimization

38

Business Value

39

Social Engagement Application

Beverage Company

40

Tweets, FB Messages Person, Activity links Graph Traversal

Data Shape

41

Consumer Activity Graph

Wal*Mart.com

CVS

Dollar General

Ebay

Ebay Motors

Toys R UsStubHub

Shopping.comSam’s

42

Techniques

Property Graph(Titan)

Key/Value Store(MapR M7)

Social Activity Stream

Graph Traversal(Faunus)

43

Business Value

44

Questions?

top related