aws partner webcast - hadoop in the cloud: unlocking the potential of big data on aws

55
Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Upload: amazon-web-services

Post on 26-Jan-2015

109 views

Category:

Technology


0 download

DESCRIPTION

Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop. This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs. What we'll learn: • See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps. • Examples of real world applications and customer successes in production • Best practices for maximizing the benefits of using MapR with AWS.

TRANSCRIPT

Page 1: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Page 2: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Introducing

Maya Cabassi Partner Marketing Manager

Amazon Web Services

Page 3: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Webinar Overview Submit Your Questions using the Q&A tool.

A copy of today’s presentation will be made available on:

AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/

AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-nPlVzJI-

ccQXlxjSvJmw

Page 4: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Introducing

Jonathan Fritz Sr. Product Manager

Amazon Web Services

Steve Wooledge VP, Product Marketing

MapR Technologies

Bruce Penn Principal Sales Engineer

MapR Technologies

Page 5: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

What We’ll Cover • Elastic MapReduce (EMR): Hadoop in the cloud

• Elastic clusters tailored for your workflows

• Best container to run Hadoop in the AWS Ecosystem

• Introduction to MapR’s Hadoop Platform

• Defining feature

• Increased performance

• Case Studies: MapR + Elastic MapReduce

• Q&A

Page 6: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Hadoop in the Cloud Using MapR and Amazon Elastic MapReduce to unlock Big Data

Jonathan Fritz, Sr. Product Manager, Amazon Web Services

Steve Wooledge, VP, Product Marketing, MapR Technologies

Page 7: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Agenda

• Elastic MapReduce (EMR): Hadoop in the cloud – Elastic clusters tailored for your workflows

– Best container to run Hadoop in the AWS Ecosystem

• Introduction to MapR’s Hadoop Platform – Defining features

– Increased performance

• Case Studies: MapR + Elastic MapReduce

• Q+A

Page 8: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

• YouTube users upload 48 hours of new video/min/day

• Twitter sees roughly 175 million tweets every day

The Three V’s: the drivers behind Big Data

Variety

Velocity

Volume

• Facebook analyzes 30+ petabytes of user generated data

• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide

• 2.7 zetabyes data exist in the digital universe today.

• Data production will be 44 times greater in 2020 vs. 2009

Page 9: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Hadoop is the right system for Big Data

• Scalable and fault tolerant

• Flexibility for multiple languages

and data formats

• Open source

• Ecosystem of tools

• Batch and real-time analytics

Page 10: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Challenges with managing Hadoop

On-Premises

• Manage HDFS, upgrades, and system administration

• Pay for expensive support contracts

• Select hardware in advance and stick with predictions

Cloud

• Hard to tightly integrate with AWS storage services

• Independently manage and monitor clusters

Page 11: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Amazon Elastic MapReduce (EMR) is the

easiest way to run Hadoop in the cloud.

Page 12: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

• Managed services

• Easy to tune clusters and trim costs

• Support for multiple AWS datastores

• Unique features and ecosystem support

Why Amazon Elastic MapReduce?

Page 13: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Input data

S3, DynamoDB, Redshift

Page 14: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic

MapReduce

Code

Input data

S3, DynamoDB, Redshift

Page 15: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic

MapReduce

Code Name

node

Input data

S3, DynamoDB, Redshift

Page 16: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic

MapReduce

Code Name

node

Input data

Elastic

cluster

S3, DynamoDB, Redshift

S3/HDFS

Page 17: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic

MapReduce

Code Name

node

Input data

S3/HDFS Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

Page 18: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic

MapReduce

Code Name

node

Output

Input data

Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

S3/HDFS

Page 19: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Output

Input data

S3, DynamoDB, Redshift

Page 20: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Elastic clusters. Customize size and type to reduce costs.

Page 21: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Choose your instance types Try out different configurations to find your

optimal architecture.

CPU

c1.xlarge

cc1.4xlarge

cc2.8xlarge

Memory

m1.large

m2.2xlarge

m2.4xlarge

Disk

hs1.8xlarge

Page 22: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and

only pay for what you need.

=

Page 23: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

10 hours

Resizable clusters Easy to add and remove compute

capacity on your cluster.

Page 24: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

6 hours

Resizable clusters Easy to add and remove compute

capacity on your cluster.

Page 25: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Peak capacity

Resizable clusters Easy to add and remove compute

capacity on your cluster.

Page 26: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Matched compute

demands with cluster sizing.

Resizable clusters Easy to add and remove compute

capacity on your cluster.

10 hours

Page 27: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Use Spot and Reserved Instances. Minimize costs by supplementing on-demand pricing.

Page 28: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Easy to use Spot Instances Name-your-price supercomputing to minimize costs.

Spot for

task nodes

Up to 90%

off EC2

on-demand

pricing

On-demand for

core nodes

Standard EC2

pricing for

on-demand

capacity.

Page 29: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

24/7 clusters on Reserved Instances Minimize cost for consistent capacity.

Reserved

Instances for

long running

clusters.

Up to 65% off

on-demand

pricing.

Page 30: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Your data, your choice. Easy to integrate Elastic MapReduce with your datastores.

Page 31: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS
Page 32: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Using Amazon S3 and On-Cluster Storage

Data Sources

Transient EMR cluster

for batch map/reduce jobs

for daily reports

Long running EMR cluster

holding data on the cluster

in a NoSQL database

Weekly Report

Ad-hoc Query

Data aggregated

and stored in

Amazon S3

Page 33: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Use Amazon EMR with Redshift and S3

Data Sources

Daily data

aggregated in

Amazon S3

Amazon EMR

cluster used to

process data

Processed data

loaded into

Amazon Redshift

data warehouse

Page 34: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 34 © 2014 MapR Technologies

Introduction to MapR

Page 35: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 35

MAPR: WORLDWIDE HADOOP TECHNOLOGY LEADER

UNIQUELY ADDRESSES BOTH

ANALYTIC AND OPERATIONAL USE CASES

500+ PAYING CUSTOMERS

HQ

Page 36: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 36

Hadoop Distributions

Open Source Open Source

Distribution A Distribution B

MA

NA

GE

ME

NT

Open Source

MA

NA

GE

ME

NT

ARCHITECTURAL

INNOVATIONS

Page 37: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 37

Ma

na

ge

me

nt

MapR Data Platform

APACHE HADOOP & OSS ECOSYSTEM

Impala Shark Hive Pig Hue Oozie ZooKeeper

Mahout MLLib Juju Solr Cascading HttpFS Flume

Storm Spark

Streaming YARN MapReduce HBase Whirr Sqoop

Drill Tez

Knox Sentry

Spark Falcon

• High availability

• Data protection

• Disaster recovery

• Standard file

access

• Standard database

access

• Pluggable services

• Broad developer

support

• Enterprise security

authorization

• Wire-level

authentication

• Data governance

• Ability to support

predictive

analytics, real-time

database

operations, and

support high arrival

rate data

• Ability to logically

divide a cluster to

support different

use cases, job

types, user groups,

and administrators

• 2X to 7X higher

performance

• Consistent, low

latency

MapR Distribution for Hadoop

Enterprise-grade Security Operational Performance Multi-tenancy Interoperability

Page 38: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 38 © 2014 MapR Technologies

A winning combination: MapR with Amazon Elastic MapReduce

Page 39: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 39

Launching a Cluster

MapR Option Integrated within EMR

Page 40: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 40

MapR: Designed for Both Transient and Long-Term Clusters

• High Availability

• Easy Development

• Multi-Tenancy

• World-Record Performance

• Breadth of Applications

Fastest On-Ramp to

Develop Hadoop

Applications

Best Platform for

Long-Term Hadoop

Production Success

Page 41: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 41

Resource Manager HA,

Application Master HA

JobTracker HA for MRv1

NFS HA

Instant recovery

• YARN jobs are not impacted by failures

• Continue to meet SLAs with MapReduce v2

• MapReduce v1 jobs are not impacted by failures

• Meet your data processing SLAs

• High throughput and resilience for NFS-based data

ingestion, import/export and multi-client access

• Files and tables are accessible within seconds of a node

failure or cluster restart

High Availability (HA) For Hadoop

No-NameNode architecture • Distributed metadata can self-heal

• No practical limit on # of files

Page 42: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 42

Direct Integration with Existing Applications

• 100% POSIX compliant

• Industry standard APIs - NFS, ODBC, LDAP, REST

• More 3rd-party solutions

• No proprietary connectors required

• Language neutral

Page 43: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 43

Multi-Tenancy Support for Parallelized App Development Isolation

• Tasks sandboxed so they don’t impact other tasks or system daemons

• System resources protected from runaway jobs

• Volume-based data segregation based on users and groups

• Volume-based data placement to control

• Label-based job scheduling to control

Quotas

• Storage quotas by volume/user/group

• CPU and memory quotas by queue/user/group

Security and delegation

• Fine-grained administration permissions including volume-level delegation

• Authenticate users to AD, LDAP and Kerberos via Linux PAM

Reporting

• Detailed reporting on resource usage (75+ different metrics)

• All reports are available via UI, CLI and REST API

Page 44: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 44

MapR M7: The Best In-Hadoop NoSQL Database

Benefit Features

High Performance Over 1 million ops/sec with 10 node cluster

Continuous Low Latency No I/O storms, no compactions

24x7 Applications Instant recovery, online schema modification, snapshots,

mirroring

Zero Administration No processes to manage, automated splits, self-tuning

High Scalability 1 trillion tables, billions of rows, millions of columns

Low TCO Files and tables on one platform, more work with fewer

nodes

Performance

Reliability

Easy

Administration

Page 45: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 45

425

925

333

563

367

532

163

331

IDH 2.4.1

CDH 4.3

Source: Flux7 Labs Study, October 2013

Flux7: Comparative Study of Hadoop Distributions

Web Search and Data Analytics Benchmarks

Page Rank Hive JOIN Query

Tim

e in

se

co

nd

s

Tim

e in

Se

co

nd

s

Lower is Better

Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz

4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz

Page 46: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 46

Comparative Study of Hadoop Distributions

212

59

262

69

276

64

475 465 IDH

CDH

HDP

MapR

Source: Flux7 Labs Study, October 2013

http://flux7.com/blogs/case-studies/hadoop-distributions-a-detailed-comparative-study-whitepaper/

Read and Write Throughput Benchmarks

DFSIO Read Throughput DFSIO Write Throughput

MB

pe

r S

eco

nd

MB

pe

r S

eco

nd

Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz

4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz

Higher is Better

Page 47: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 47

MapR M7: The Best In-Hadoop Database

NoSQL Columnar Store

Apache HBase API

Integrated with Hadoop

HBase

JVM

HDFS

JVM

ext3/ext4

Disks

Other Distros

Tables/Files

Disks

MapR M7

The most scalable, enterprise-grade,

NoSQL database that supports online applications and analytics

Page 48: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 48 © 2014 MapR Technologies

Customer Case Studies MapR with Amazon Elastic MapReduce in Action

Page 49: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 49

Use cases for MapR with Amazon EMR

• Targeted advertising / clickstream analysis

• Security: anti-virus, fraud detection, image

recognition

• Pattern matching / recommendations

• Reporting / BI

• Bio-informatics (genome analysis)

• Financial simulation (Monte Carlo simulation)

• File processing (resize jpegs, video encoding)

• Web indexing

Page 50: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 50

Case Study

Outcomes from MapR Deployment w/ EMR

• Increased flexibility to scale at lower costs

• Faster turnaround for customer requests

• Ease of experimentation

Challenges

• RDBMS on AWS too slow

• Solution must be compatible with AWS & Java 7

• High performance

Page 51: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 51

Case Study

Outcomes from MapR Deployment w/ EMR

• Faster machine learning performance

enables more/faster simulations

• MapR M7 provides geospatial database

backed by Amazon S3

Challenges

• Large volumes of sensor data

• Project weather for 2.5 years

at every 20x20 plot across the US

• Climatology simulations need to quickly

experiment at small scale and then scale reliably

Page 52: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 52 © 2014 MapR Technologies

Demo

Page 53: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

© 2014 MapR Technologies 53

MapR/EMR Demonstration

• Create MapR cluster using EMR

• Review MapR Control System (MCS)

• Show S3 and MapR integration

• Demonstrate MapR’s real-time capability

• Connect Mac to MapR via NFS

• Run queries with HiveServer2 and Impala

• Visualize data with Tableau

Page 54: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

Questions and Contact

MapR: http://aws.amazon.com/elasticmapreduce/mapr/

[email protected]

AWS Contact: aws.amazon.com/contact-us

[email protected]

@mapr @awscloud

Maprtech Amazon Web Services

Page 55: AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

We’d like your feedback. Please complete a short survey

https://aws.asia.qualtrics.com/SE/?SID=SV_brzWlylHrqM29tr