aws partner webcast - hadoop in the cloud: unlocking the potential of big data on aws
DESCRIPTION
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop. This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs. What we'll learn: • See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps. • Examples of real world applications and customer successes in production • Best practices for maximizing the benefits of using MapR with AWS.TRANSCRIPT
Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS
Introducing
Maya Cabassi Partner Marketing Manager
Amazon Web Services
Webinar Overview Submit Your Questions using the Q&A tool.
A copy of today’s presentation will be made available on:
AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/
AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-nPlVzJI-
ccQXlxjSvJmw
Introducing
Jonathan Fritz Sr. Product Manager
Amazon Web Services
Steve Wooledge VP, Product Marketing
MapR Technologies
Bruce Penn Principal Sales Engineer
MapR Technologies
What We’ll Cover • Elastic MapReduce (EMR): Hadoop in the cloud
• Elastic clusters tailored for your workflows
• Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform
• Defining feature
• Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q&A
Hadoop in the Cloud Using MapR and Amazon Elastic MapReduce to unlock Big Data
Jonathan Fritz, Sr. Product Manager, Amazon Web Services
Steve Wooledge, VP, Product Marketing, MapR Technologies
Agenda
• Elastic MapReduce (EMR): Hadoop in the cloud – Elastic clusters tailored for your workflows
– Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform – Defining features
– Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q+A
• YouTube users upload 48 hours of new video/min/day
• Twitter sees roughly 175 million tweets every day
The Three V’s: the drivers behind Big Data
Variety
Velocity
Volume
• Facebook analyzes 30+ petabytes of user generated data
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide
• 2.7 zetabyes data exist in the digital universe today.
• Data production will be 44 times greater in 2020 vs. 2009
Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
Challenges with managing Hadoop
On-Premises
• Manage HDFS, upgrades, and system administration
• Pay for expensive support contracts
• Select hardware in advance and stick with predictions
Cloud
• Hard to tightly integrate with AWS storage services
• Independently manage and monitor clusters
Amazon Elastic MapReduce (EMR) is the
easiest way to run Hadoop in the cloud.
• Managed services
• Easy to tune clusters and trim costs
• Support for multiple AWS datastores
• Unique features and ecosystem support
Why Amazon Elastic MapReduce?
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
Elastic
cluster
S3, DynamoDB, Redshift
S3/HDFS
Elastic
MapReduce
Code Name
node
Input data
S3/HDFS Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster
Elastic
MapReduce
Code Name
node
Output
Input data
Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster
S3/HDFS
Output
Input data
S3, DynamoDB, Redshift
Elastic clusters. Customize size and type to reduce costs.
Choose your instance types Try out different configurations to find your
optimal architecture.
CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge
Memory
m1.large
m2.2xlarge
m2.4xlarge
Disk
hs1.8xlarge
Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need.
=
10 hours
Resizable clusters Easy to add and remove compute
capacity on your cluster.
6 hours
Resizable clusters Easy to add and remove compute
capacity on your cluster.
Peak capacity
Resizable clusters Easy to add and remove compute
capacity on your cluster.
Matched compute
demands with cluster sizing.
Resizable clusters Easy to add and remove compute
capacity on your cluster.
10 hours
Use Spot and Reserved Instances. Minimize costs by supplementing on-demand pricing.
Easy to use Spot Instances Name-your-price supercomputing to minimize costs.
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard EC2
pricing for
on-demand
capacity.
24/7 clusters on Reserved Instances Minimize cost for consistent capacity.
Reserved
Instances for
long running
clusters.
Up to 65% off
on-demand
pricing.
Your data, your choice. Easy to integrate Elastic MapReduce with your datastores.
Using Amazon S3 and On-Cluster Storage
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data on the cluster
in a NoSQL database
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Use Amazon EMR with Redshift and S3
Data Sources
Daily data
aggregated in
Amazon S3
Amazon EMR
cluster used to
process data
Processed data
loaded into
Amazon Redshift
data warehouse
© 2014 MapR Technologies 34 © 2014 MapR Technologies
Introduction to MapR
© 2014 MapR Technologies 35
MAPR: WORLDWIDE HADOOP TECHNOLOGY LEADER
UNIQUELY ADDRESSES BOTH
ANALYTIC AND OPERATIONAL USE CASES
500+ PAYING CUSTOMERS
HQ
© 2014 MapR Technologies 36
Hadoop Distributions
Open Source Open Source
Distribution A Distribution B
MA
NA
GE
ME
NT
Open Source
MA
NA
GE
ME
NT
ARCHITECTURAL
INNOVATIONS
© 2014 MapR Technologies 37
Ma
na
ge
me
nt
MapR Data Platform
APACHE HADOOP & OSS ECOSYSTEM
Impala Shark Hive Pig Hue Oozie ZooKeeper
Mahout MLLib Juju Solr Cascading HttpFS Flume
Storm Spark
Streaming YARN MapReduce HBase Whirr Sqoop
Drill Tez
Knox Sentry
Spark Falcon
• High availability
• Data protection
• Disaster recovery
• Standard file
access
• Standard database
access
• Pluggable services
• Broad developer
support
• Enterprise security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive
analytics, real-time
database
operations, and
support high arrival
rate data
• Ability to logically
divide a cluster to
support different
use cases, job
types, user groups,
and administrators
• 2X to 7X higher
performance
• Consistent, low
latency
MapR Distribution for Hadoop
Enterprise-grade Security Operational Performance Multi-tenancy Interoperability
© 2014 MapR Technologies 38 © 2014 MapR Technologies
A winning combination: MapR with Amazon Elastic MapReduce
© 2014 MapR Technologies 39
Launching a Cluster
MapR Option Integrated within EMR
© 2014 MapR Technologies 40
MapR: Designed for Both Transient and Long-Term Clusters
• High Availability
• Easy Development
• Multi-Tenancy
• World-Record Performance
• Breadth of Applications
Fastest On-Ramp to
Develop Hadoop
Applications
Best Platform for
Long-Term Hadoop
Production Success
© 2014 MapR Technologies 41
Resource Manager HA,
Application Master HA
JobTracker HA for MRv1
NFS HA
Instant recovery
• YARN jobs are not impacted by failures
• Continue to meet SLAs with MapReduce v2
• MapReduce v1 jobs are not impacted by failures
• Meet your data processing SLAs
• High throughput and resilience for NFS-based data
ingestion, import/export and multi-client access
• Files and tables are accessible within seconds of a node
failure or cluster restart
High Availability (HA) For Hadoop
No-NameNode architecture • Distributed metadata can self-heal
• No practical limit on # of files
© 2014 MapR Technologies 42
Direct Integration with Existing Applications
• 100% POSIX compliant
• Industry standard APIs - NFS, ODBC, LDAP, REST
• More 3rd-party solutions
• No proprietary connectors required
• Language neutral
© 2014 MapR Technologies 43
Multi-Tenancy Support for Parallelized App Development Isolation
• Tasks sandboxed so they don’t impact other tasks or system daemons
• System resources protected from runaway jobs
• Volume-based data segregation based on users and groups
• Volume-based data placement to control
• Label-based job scheduling to control
Quotas
• Storage quotas by volume/user/group
• CPU and memory quotas by queue/user/group
Security and delegation
• Fine-grained administration permissions including volume-level delegation
• Authenticate users to AD, LDAP and Kerberos via Linux PAM
Reporting
• Detailed reporting on resource usage (75+ different metrics)
• All reports are available via UI, CLI and REST API
© 2014 MapR Technologies 44
MapR M7: The Best In-Hadoop NoSQL Database
Benefit Features
High Performance Over 1 million ops/sec with 10 node cluster
Continuous Low Latency No I/O storms, no compactions
24x7 Applications Instant recovery, online schema modification, snapshots,
mirroring
Zero Administration No processes to manage, automated splits, self-tuning
High Scalability 1 trillion tables, billions of rows, millions of columns
Low TCO Files and tables on one platform, more work with fewer
nodes
Performance
Reliability
Easy
Administration
© 2014 MapR Technologies 45
425
925
333
563
367
532
163
331
IDH 2.4.1
CDH 4.3
Source: Flux7 Labs Study, October 2013
Flux7: Comparative Study of Hadoop Distributions
Web Search and Data Analytics Benchmarks
Page Rank Hive JOIN Query
Tim
e in
se
co
nd
s
Tim
e in
Se
co
nd
s
Lower is Better
Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz
4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz
© 2014 MapR Technologies 46
Comparative Study of Hadoop Distributions
212
59
262
69
276
64
475 465 IDH
CDH
HDP
MapR
Source: Flux7 Labs Study, October 2013
http://flux7.com/blogs/case-studies/hadoop-distributions-a-detailed-comparative-study-whitepaper/
Read and Write Throughput Benchmarks
DFSIO Read Throughput DFSIO Write Throughput
MB
pe
r S
eco
nd
MB
pe
r S
eco
nd
Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz
4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz
Higher is Better
© 2014 MapR Technologies 47
MapR M7: The Best In-Hadoop Database
NoSQL Columnar Store
Apache HBase API
Integrated with Hadoop
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
© 2014 MapR Technologies 48 © 2014 MapR Technologies
Customer Case Studies MapR with Amazon Elastic MapReduce in Action
© 2014 MapR Technologies 49
Use cases for MapR with Amazon EMR
• Targeted advertising / clickstream analysis
• Security: anti-virus, fraud detection, image
recognition
• Pattern matching / recommendations
• Reporting / BI
• Bio-informatics (genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs, video encoding)
• Web indexing
© 2014 MapR Technologies 50
Case Study
Outcomes from MapR Deployment w/ EMR
• Increased flexibility to scale at lower costs
• Faster turnaround for customer requests
• Ease of experimentation
Challenges
• RDBMS on AWS too slow
• Solution must be compatible with AWS & Java 7
• High performance
© 2014 MapR Technologies 51
Case Study
Outcomes from MapR Deployment w/ EMR
• Faster machine learning performance
enables more/faster simulations
• MapR M7 provides geospatial database
backed by Amazon S3
Challenges
• Large volumes of sensor data
• Project weather for 2.5 years
at every 20x20 plot across the US
• Climatology simulations need to quickly
experiment at small scale and then scale reliably
© 2014 MapR Technologies 52 © 2014 MapR Technologies
Demo
© 2014 MapR Technologies 53
MapR/EMR Demonstration
• Create MapR cluster using EMR
• Review MapR Control System (MCS)
• Show S3 and MapR integration
• Demonstrate MapR’s real-time capability
• Connect Mac to MapR via NFS
• Run queries with HiveServer2 and Impala
• Visualize data with Tableau
Questions and Contact
MapR: http://aws.amazon.com/elasticmapreduce/mapr/
AWS Contact: aws.amazon.com/contact-us
@mapr @awscloud
Maprtech Amazon Web Services
We’d like your feedback. Please complete a short survey
https://aws.asia.qualtrics.com/SE/?SID=SV_brzWlylHrqM29tr