how to run your hadoop cluster in 10 minutes
TRANSCRIPT
![Page 1: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vladimir Simek, Solutions Architect @ AWS
22/03/2016
Amazon Elastic MapReduceHow to run your Hadoop Cluster in 10 minutes
![Page 2: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/2.jpg)
Agenda
• Two different companies – 2 stories• Challenges with Big Data on premises• Technical introduction to Amazon EMR• Amazon EMR features and benefits• Use case of AOL – moving 2 PB on-prem Hadoop
cluster to the AWS cloud• Short demos
![Page 3: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/3.jpg)
In the beginning – 2 different stories
![Page 4: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/4.jpg)
• In 2007 New York Times has decided create a digital archive on the web – all articles from 1851-1922
• 11 million articles (4 TB of data) composed of:• 405,000 large TIFF images• 405,000 XML files• 3.3 million SGML files
• Used Amazon EC2 and Hadoop to process the data
![Page 5: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/5.jpg)
Time to process?Less than 24 hours
Costs?About $240
![Page 6: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/6.jpg)
![Page 7: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/7.jpg)
(Undisclosed international company) – subsidiary in France
• In 2014 - has decided to run a POC on Big Data analytics
• What was the 1st step they did? Invested €7M into server purchase
![Page 8: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/8.jpg)
“Want to increase innovation?Lower the cost of failure.”
Joi Ito, Director of MIT Media Lab
![Page 9: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/9.jpg)
How many big ticket technology ideas can your budget tolerate?
€7M
![Page 10: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/10.jpg)
(Big) Data for Competitive Advantage
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real-time bidding
Clickstream analysis
Fraud detection
Security threat detection
![Page 11: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/11.jpg)
Challenges with In-House Infrastructure
Fixed Cost
Slow DeploymentCycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
![Page 12: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/12.jpg)
What is Amazon EMR and how it addresses such issues?
![Page 13: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/13.jpg)
Amazon EMR • Managed platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR
distribution• Leverage the elasticity of the cloud• Baked in security features• Pay by the hour and save with Spot• Flexibility to customize
![Page 14: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/14.jpg)
Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud
![Page 15: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/15.jpg)
What Do I Need to Build a Cluster ?
1. Choose instances2. Choose your software3. Choose your access method
![Page 16: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/16.jpg)
Choice of Multiple Instances
CPUc3 family
cc1.4xlargecc2.8xlarge
Memorym2 familyr3 family
Disk/IOd2 familyi2 family
Generalm1 familym3 family
Machine Learning
Batch Processing
In-memory (Spark & Presto)
Large HDFS
![Page 17: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/17.jpg)
Select an Instance
![Page 18: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/18.jpg)
Choose Your Software (Quick Bundles)
![Page 19: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/19.jpg)
Choose Your Software – Custom
![Page 20: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/20.jpg)
Hadoop Applications Available in Amazon EMR
![Page 21: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/21.jpg)
Choose Security and Access Control
![Page 22: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/22.jpg)
You Are Up and Running!
![Page 23: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/23.jpg)
You Are Up and Running!
Master Node DNS
![Page 24: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/24.jpg)
You Are Up and Running!
Information about the software you are running, logs and features
![Page 25: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/25.jpg)
You Are Up and Running!
Infrastructure for this cluster
![Page 26: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/26.jpg)
You Are Up and Running!
Security Groups and Roles
![Page 27: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/27.jpg)
Use the CLI
aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK
![Page 28: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/28.jpg)
Demo – Build EMR cluster
![Page 29: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/29.jpg)
Now that I have a cluster, I need to process some data
![Page 30: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/30.jpg)
Amazon EMR can process data from multiple sources
Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis
![Page 31: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/31.jpg)
Amazon EMR can process data from multiple sources
Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis
![Page 32: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/32.jpg)
On an On-premises Environment
Tightly coupled
![Page 33: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/33.jpg)
Compute and Storage Grow Together
Tightly coupled
Storage grows along with computeCompute requirements vary
![Page 34: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/34.jpg)
Underutilized or Scarce Resources
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120Re-processingWeekly peaks
Steady state
![Page 35: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/35.jpg)
Underutilized or Scarce Resources
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120
Underutilized capacity
Provisioned capacity
![Page 36: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/36.jpg)
Contention for Same Resources
Compute bound Memory
bound
![Page 37: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/37.jpg)
Separation of Resources Creates Data Silos
Team A
![Page 38: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/38.jpg)
Replication Adds to Cost
3x
Single datacenter
![Page 39: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/39.jpg)
So how does Amazon EMR solve these problems?
![Page 40: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/40.jpg)
Decouple Storage and Compute
![Page 41: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/41.jpg)
Amazon S3 is Your Persistent Data Store
Designed for 11 9’s durability$0.03 / GB / month in Ireland Lifecycle policiesVersioning Distributed by default EMRFS
Amazon S3
![Page 42: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/42.jpg)
The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than
open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects
![Page 43: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/43.jpg)
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION ‘samples/pig-apache/input/'
![Page 44: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/44.jpg)
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'
![Page 45: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/45.jpg)
Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3
![Page 46: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/46.jpg)
Auto-Terminate Clusters
![Page 47: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/47.jpg)
You Can Build a Pipeline
![Page 48: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/48.jpg)
Run Transient or Long-Running Clusters
![Page 49: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/49.jpg)
Benefit 2: Resize Your Cluster
![Page 50: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/50.jpg)
Resize the Cluster
Scale Up, Scale Down, Stop a resize, issue a resize on another
![Page 51: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/51.jpg)
How do you scale up and save cost ?
![Page 52: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/52.jpg)
Spot Instance
Bid Price
OD Price
![Page 53: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/53.jpg)
Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
![Page 54: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/54.jpg)
Spot Integration with Amazon EMR
• Can provision instances from the Spot market• Impact of interruption
• Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application
dependent)
![Page 55: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/55.jpg)
Scale up with Spot Instances
10 node cluster running for 14 hoursCost = 1.0 * 10 * 14 = $140
![Page 56: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/56.jpg)
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
![Page 57: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/57.jpg)
Resize Nodes with Spot Instances
20 node cluster running for 7 hoursCost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35
Total $105
![Page 58: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/58.jpg)
Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
![Page 59: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/59.jpg)
Intelligent Scale Down
![Page 60: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/60.jpg)
Effectively Utilize Clusters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120
![Page 61: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/61.jpg)
Benefit 3: Logical Separation of Jobs
Hive, Pig,Cascading
Prod
Presto Ad-Hoc
Amazon S3
![Page 62: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/62.jpg)
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
![Page 63: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/63.jpg)
Demo 2 – Word Count Example
![Page 64: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/64.jpg)
Case study: How AOL moved a 2 PB cluster to the AWS cloud
![Page 65: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/65.jpg)
AOL Data Platforms Architecture 2014
AOL
Source Systems In-house Hadoop Cluster
Database
Reporting Tools
Users
![Page 66: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/66.jpg)
Data Stats & Insights
Cluster Size2 PB
In-House Cluster
100 Nodes
RawData/Day 2-3 TB
DataRetention
13-24 Months
![Page 67: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/67.jpg)
Challenges with In-House Infrastructure
Fixed Cost
Slow DeploymentCycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
![Page 68: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/68.jpg)
AOL Data Platforms Architecture 2015
1 2
2
34
56
Source Systems
Amazon S3
Amazon EMR
Cluster Watchdog
Amazon SNS
Amazon IAM
AOL
AWS Direct Connect
Reporting Tools
Database
Users
![Page 69: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/69.jpg)
EMR Design Options
TransientAmazon S3Elastic ClusterOn-Demand vs. Reserved vs. Core NodesAmazon EMR
vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
![Page 70: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/70.jpg)
AWS vs. In-House Cost
Service
0 1 2 3 4 5
Cost Comparison
AWSIn-House
Service
Cost Comparison
0 2 4 6
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
![Page 71: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/71.jpg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores...
Restatement Use Case• Restate historical data going back 6 months
Availability Zones10
550EMR Clusters
24,000Spot EC2 Instances
010203040506070
Timing Comparison
In-HouseAWS
![Page 72: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/72.jpg)
Any questions?
![Page 73: How to run your Hadoop Cluster in 10 minutes](https://reader035.vdocuments.net/reader035/viewer/2022062522/58d0f4151a28abc00b8b4bc3/html5/thumbnails/73.jpg)
Thank you!