big data lessons from the cloud

48
1 Big Data Lessons from the Cloud Jack Norris, MapR Technologies

Upload: mapr-technologies

Post on 13-Jan-2015

430 views

Category:

Technology


0 download

DESCRIPTION

Learn about the Challenge of Big Data and how Hadoop in the Cloud, a flexible infrastructure for Big Data, is changing everything!

TRANSCRIPT

Page 1: Big Data Lessons from the Cloud

1

Big Data Lessons from the CloudJack Norris, MapR Technologies

Page 2: Big Data Lessons from the Cloud

2

Data VolumeGrowing 44x

2020: 35.2 Zettabytes

2010:1.2

Zettabytes

The Challenge of Big Data

Business Analytics Requires a New Approach

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

IDC Digital Universe

Study

Data is Growing Faster than Moore’s Law

Page 3: Big Data Lessons from the Cloud

3

What are the Requirements for Big Data?

Process it quickly

Combine multiple data sources

Expand analysis

Page 4: Big Data Lessons from the Cloud

4

Big Data in the Cloud

Distributed, scalable computing platform– Data/Compute framework– Commodity hardware

Pioneered at Google

Commercially available as Hadoop

Page 5: Big Data Lessons from the Cloud

5

Important Drivers for Hadoop

Data on compute

You don’t need to know what questions to ask beforehand

Simple algorithms on Big Data

Analysis of unstructured data

Page 6: Big Data Lessons from the Cloud

6

Hadoop Growth

Page 7: Big Data Lessons from the Cloud

7

Apache Hadoop Distribution

Combination of Various Packages

Integrated, tested and hardened

Pig

Hive

HBase

Mahout

Oozie

Whirr

Avro

Cascading

Nagios

Ganglia

Management Control SystemData Platform

Data Platform

Flume

Sqoop

HCatalog

Zookeeper

Drill

Map

Reduc

e

Page 8: Big Data Lessons from the Cloud

8

Hadoop in the Cloud

Page 9: Big Data Lessons from the Cloud

9

Amazon Example: Elastic MapReduce (EMR)

EMR provides Hadoop as a Service in the Cloud

Page 10: Big Data Lessons from the Cloud

10

How does it work?

EMR

EMR ClusterS3

You can store the data in S3 and/or on

the cluster (HDFS)

You decide which Hadoop distribution to run, how many

nodes, and what types of nodes

Page 11: Big Data Lessons from the Cloud

11

EMR

EMR Cluster

How does it work?

S3

You can easily add additional nodes

Page 12: Big Data Lessons from the Cloud

12

How does it work?

EMR ClusterS3

When processing is complete, you can shut down the cluster

(and stop paying)

Page 13: Big Data Lessons from the Cloud

13

Launching a Cluster

Page 14: Big Data Lessons from the Cloud

14

Thousands of customers, 2 million+ clusters

Page 15: Big Data Lessons from the Cloud

16

Hadoop in the Cloud is a Flexible Infrastructure for Big Data

Page 16: Big Data Lessons from the Cloud

17 17

MinuteSort - Amount of data that can be sorted in 60.00 seconds.– Benchmark is technology Agnostic

Previous record was 1.4TB set by Microsoft Research using specially designed software across physical hardware

Previous Hadoop MinuteSort record was 578 GB

Cloud Example of Scalability

Page 17: Big Data Lessons from the Cloud

18

A New MinuteSort World Record

New World Record1.5 TB in 60seconds

3X more data processed than the previous Hadoop Record

Page 18: Big Data Lessons from the Cloud

19

Previous Record

3452 physical serversPrepare datacenter

Rack and stack serversMaintain hardware

2103 instancesInvoke gcutil command

Months Minutes

Cloud Deployment Comparison

Page 19: Big Data Lessons from the Cloud

20

Previous Record

3452 1U servers x$4K/server =

2103 n1-standard-4-d x$.58/instance hour x

60 seconds =

$13,808,000 $20.33

Cost Comparison

Page 20: Big Data Lessons from the Cloud

21

Use Case 1: Expand Data for Analysis

Page 21: Big Data Lessons from the Cloud

22

Comparing an EDW to Hadoop

Major telecom vendor Key step in billing pipeline

handled by data warehouse (EDW)

EDW at maximum capacity Multiple rounds of software

optimization already done Revenue limiting (= career

limiting) bottleneck

Page 22: Big Data Lessons from the Cloud

23

TransformationExtract and Load

CDR billing records

Billing reports

Data Warehouse

Customer bills

Original Flow

Page 23: Big Data Lessons from the Cloud

24

Problem Analysis

70% of EDW load is related to call detail record (CDR) normalization

–< 10% of total lines of code–CDR normalization difficult within the EDW–Binary extraction and conversion

Data rates are too high for upstream transform

–Requires high volume joins

Page 24: Big Data Lessons from the Cloud

25

ETLCDR billing

records

Billing reports

Data Warehouse

Customer billing

With ETL Offload

Hadoop Cluster

Page 25: Big Data Lessons from the Cloud

26

ETL Offload

Hadoop Distribution

Page 26: Big Data Lessons from the Cloud

27

Simplified Analysis

70% of EDW consumed by ETL processing – Offload frees capacity

EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost

Additional EDW only increases capacity by 50% due to poor division of labor

Page 27: Big Data Lessons from the Cloud

28

The Results

EDW strategy–1.5 x performance–$30 million

Hadoop Strategy–3 x faster–20x cost/performance advantage for Hadoop strategy–With High Availability and data protection

Page 28: Big Data Lessons from the Cloud

29

Use Case 2:Combine Many Different Data Sources

Page 29: Big Data Lessons from the Cloud

30

Combining different feeds on one platform

Hadoop and HBase Storage and Processing

Real-time data feed from social network

Stored in Hadoop

Historical Purchase Information

Predictive Analytics from Historical data combined with NoSQL querying on real-time

social networking data

Billing Data

Page 30: Big Data Lessons from the Cloud

31

Results

New Service Rolled out in 1 quarter

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Includes data versioning support for easier development and updating of models

Page 31: Big Data Lessons from the Cloud

32

Collect Data from Dispersed Data Sources

Page 32: Big Data Lessons from the Cloud

33

Leading Veterinary Equipment Mfgr

Aggregates data across 6000 veterinary clinics Nightly extracts from each clinic One job runs once a week for a few hours Expanding applications to include vaccination analysis for 300M

vaccinations Predictive analytics for disease prevalence and prevention

Page 33: Big Data Lessons from the Cloud

34

Use Case 3:New Application from New Data Source

Page 34: Big Data Lessons from the Cloud

35

Ancestry.com – Family Tree

Page 35: Big Data Lessons from the Cloud

36

Overview and Requirements

Collect and Collate information from disparate sources (Text files, Images, etc.)

Leverage new data source: Spit

Machine learning techniques and DNA Matching Algorithms

Page 36: Big Data Lessons from the Cloud

37

The Results

Storage Infrastructure for billions of small and large files

Blob Store for large images through NoSQL solutions

Multi-tenant capability for data-mining and machine-learning algorithm development

Page 37: Big Data Lessons from the Cloud

38

Use Case 4:New Analytics on Existing Data

Page 38: Big Data Lessons from the Cloud

39

Analytic Flexibility

MapReduce enabled Machine learning algorithms

Enhanced Search

Real-time event processing

No need to sample the data

Fraud Detection Target Marketing Consumer Behavior Analysis …

Page 39: Big Data Lessons from the Cloud

40

Hadoop Expands Analytics

“Simple algorithms and lots of data trump complex models ”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

Page 40: Big Data Lessons from the Cloud

41

Advanced Simple Analytics

Fraud detection: – Detect small frauds using transaction patterns across the entire

portfolio– Identify compromise signature to prevent further exploits and

provide solid case explanations

Google Flu Trends vs. Traditional Flu Surveillance systems and modeling

Netflix recommendation engine– Complex models vs. adding IMDB data

Page 41: Big Data Lessons from the Cloud

42

Combine Them All

Page 42: Big Data Lessons from the Cloud

43

Clickstream Analysis –

Big Box Retailer came to Razorfish– 3.5 billion records– 71 million unique cookies– 1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)

Page 43: Big Data Lessons from the Cloud

44

Clickstream Analysis –

Targeted Ad

User recently purchased a sports movie and is searching for video games (1.7 Million per day)

Page 44: Big Data Lessons from the Cloud

45

Clickstream Analysis –

Processing time dropped from 2+ days to 8 hours (with lots more data)

Page 45: Big Data Lessons from the Cloud

46

Clickstream Analysis –

Increased Return On Ad Spend by 500%

Page 46: Big Data Lessons from the Cloud

47

Hadoop in the Cloud/EMR applications Targeted advertising / Clickstream analysis

Security: anti-virus, fraud detection, image recognition

Pattern matching / Recommendations

Data warehousing / BI

Bio-informatics (Genome analysis)

Financial simulation (Monte Carlo simulation)

File processing (resize jpegs, video encoding)

Web indexing

Page 47: Big Data Lessons from the Cloud

48

Big Data Processing

99.999% HA

Data Protection

Disaster Recovery

Scalability &

Performance

Enterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Batch Orientation:Enterprise Logfile Analysis

ETL OffloadObject Archive

Fraud DetectionClickstream Analytics

Real-Time Orientation:Sensor Analysis

“Twitterscraping”Telematics

Process Optimization

Interactive Orientation:Forensic Analysis

Analytic ModelingBI User Focus

Page 48: Big Data Lessons from the Cloud

49

Big Data Lessons from the Cloud

1. Big Data requires a new approach

2. Hadoop is a paradigm shift

3. Easy to get started with Hadoop in the Cloud

4. Scale clusters up and down in the Cloud

5. Only pay for what you use

6. Expand data for analysis

7. Combine data sources

8. New application from new data source

9. New analytics

10. Wide variety of applications appropriate for Hadoop