Download - Analyzing Big Data with AWS
![Page 1: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/1.jpg)
AWS Gov Cloud Summit II
Analyzing Big Data with AWS
Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota
![Page 2: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/2.jpg)
AWS Gov Cloud Summit II
What is Big Data?
![Page 3: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/3.jpg)
AWS Gov Cloud Summit II
Computer generated data – Application server logs (web sites, games)
– Sensor data (weather, water, smart grids)
– Images/videos (traffic, security cameras)
![Page 4: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/4.jpg)
AWS Gov Cloud Summit II
• Human generated data
– Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)
– Blogs/Reviews/Emails/Pictures
• Social graphs – Facebook, linked-in, contacts
![Page 5: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/5.jpg)
AWS Gov Cloud Summit II
Big Data is full of valuable, unanswered questions!
![Page 6: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/6.jpg)
AWS Gov Cloud Summit II
Why is Big Data Hard (and Getting Harder)?
![Page 7: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/7.jpg)
AWS Gov Cloud Summit II
• Data Volume – Unconstrained growth
– Current systems don’t scale
Why is Big Data Hard (and Getting Harder)?
![Page 8: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/8.jpg)
AWS Gov Cloud Summit II
Why is Big Data Hard (and Getting Harder)?
• Data Structure – Need to consolidate data from multiple data
sources in multiple formats across multiple businesses
![Page 9: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/9.jpg)
AWS Gov Cloud Summit II
Why is Big Data Hard (and Getting Harder)?
• Changing Data Requirements – Faster response time of fresher data
– Sampling is not good enough
– Increasing complexity of analytics
– Users demand inexpensive experimentation
![Page 10: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/10.jpg)
AWS Gov Cloud Summit II
We need tools built specifically for Big Data!
![Page 11: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/11.jpg)
AWS Gov Cloud Summit II
Innovation #1:
Apache Hadoop The MapReduce computational paradigm
Open source, scalable, fault‐tolerant, distributed system
Hadoop lowers the cost of developing a distributed system for data processing
![Page 12: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/12.jpg)
AWS Gov Cloud Summit II
Innovation #2:
Amazon Elastic Compute Cloud (EC2)
“provides resizable compute capacity in the cloud.”
Amazon EC2 lowers the cost of operating a distributed system for data processing
![Page 13: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/13.jpg)
AWS Gov Cloud Summit II
Amazon Elastic MapReduce =
Amazon EC2 + Hadoop
![Page 14: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/14.jpg)
AWS Gov Cloud Summit II
Elastic MapReduce applications • Targeted advertising / Clickstream analysis
• Security: anti-virus, fraud detection, image recognition
• Pattern matching / Recommendations
• Data warehousing / BI
• Bio-informatics (Genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs, video encoding)
• Web indexing
![Page 15: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/15.jpg)
AWS Gov Cloud Summit II
Clickstream Analysis –
• Big Box Retailer came to Razorfish
3.5 billion records
71 million unique cookies
1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
![Page 16: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/16.jpg)
AWS Gov Cloud Summit II
Targeted Ad
User recently
purchased a
sports movie and
is searching for
video games (1.7 Million per day)
Clickstream Analysis –
![Page 17: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/17.jpg)
AWS Gov Cloud Summit II
• Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop
Clickstream Analysis –
![Page 18: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/18.jpg)
AWS Gov Cloud Summit II
Processing time dropped from 2+ days to 8 hours (with lots more data)
Clickstream Analysis –
![Page 19: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/19.jpg)
AWS Gov Cloud Summit II
Increased Return On Ad Spend by 500%
Clickstream Analysis –
![Page 20: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/20.jpg)
AWS Gov Cloud Summit II
• World’s largest handmade marketplace
– 8.9 million items
– 1 billion page view per month
– $320MM 2010 GMS
![Page 21: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/21.jpg)
AWS Gov Cloud Summit II
• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes
Production DB snapshots
Web event logs
ETL – Step 1 ETL – Step 2
Job
Job
Job
![Page 22: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/22.jpg)
AWS Gov Cloud Summit II
Recommendations The Taste Test http://www.etsy.com/tastetest
![Page 23: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/23.jpg)
AWS Gov Cloud Summit II
Recommendations
etsy.com/gifts
Gift Ideas for Facebook Friends
![Page 24: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/24.jpg)
AWS Gov Cloud Summit II
![Page 25: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/25.jpg)
AWS Gov Cloud Summit II
Yelp’s Business Generates a Lot of Data
400 GB of logs per day ~12 Terabytes per month
![Page 26: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/26.jpg)
AWS Gov Cloud Summit II
They Frequently Analyze this Data to Power Key Features of their Site
![Page 27: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/27.jpg)
AWS Gov Cloud Summit II
Autocomplete Search
![Page 28: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/28.jpg)
AWS Gov Cloud Summit II
Recommendations
![Page 29: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/29.jpg)
AWS Gov Cloud Summit II
Automatic spelling corrections
![Page 30: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/30.jpg)
AWS Gov Cloud Summit II
Automatic spelling corrections
Let’s take a Look at how this works
![Page 31: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/31.jpg)
AWS Gov Cloud Summit II
Amazon S3
1) Load log file data for
six months of user search
history into Amazon S3
Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451
![Page 32: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/32.jpg)
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR Log Files
2) Spin up a 200 node
cluster of virtual servers
in the cloud
Hadoop Cluster
![Page 33: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/33.jpg)
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
3) 200 nodes simultaneously
analyze this data looking for
common misspellings
… this takes a few hours
Hadoop Cluster
![Page 34: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/34.jpg)
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
4) New common
misspellings and
suggestions loaded back
into S3
Hadoop Cluster
Log Files
![Page 35: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/35.jpg)
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
5) When the job is done,
the cluster is shut down.
Yelp only pays for the time
they used.
Log Files
![Page 36: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/36.jpg)
AWS Gov Cloud Summit II
Each of their 80 developers can do this whenever they have a big data problem to analyze
Log file
data
250 clusters spun up and down every week
![Page 37: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/37.jpg)
AWS Gov Cloud Summit II
![Page 38: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/38.jpg)
AWS Gov Cloud Summit II
Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
![Page 39: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/39.jpg)
AWS Gov Cloud Summit II
Stack
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs
Dat
a St
ack Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
![Page 40: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/40.jpg)
AWS Gov Cloud Summit II
Computing venue-to-venue similarity
• Spin up 40 node cluster
• Submit Ruby streaming job
– Invert User x Venue matrix
– Grab Co-occurrences
– Compute similarity
• Spin down cluster
• Load data to app server
![Page 41: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/41.jpg)
AWS Gov Cloud Summit II
Who is checking in?
0
0.1
0.2
0.3
0.4
0.5
0.6
Female Male
Gender
0 20 40 60 80
Age
![Page 42: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/42.jpg)
AWS Gov Cloud Summit II
What are people doing?
![Page 43: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/43.jpg)
AWS Gov Cloud Summit II
Where are our users?
![Page 44: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/44.jpg)
AWS Gov Cloud Summit II
When do people go to a place?
Gorilla Coffee
Gray's Papaya
Amorino
Thursday Friday Saturday Sunday
![Page 45: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/45.jpg)
AWS Gov Cloud Summit II
Why are people checking in? • Explore their city, discover new places
• Find friends, meet up
• Save with local deals
• Get insider tips on venues
• Personal analytics, diary
• Follow brands and celebrities
• Earn points, badges, gamification of life
• The list grows…
![Page 46: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/46.jpg)
AWS Gov Cloud Summit II
Over 1000’s customers using EMR
![Page 47: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/47.jpg)
AWS Gov Cloud Summit II
RDBMS vs. MapReduce/Hadoop
• RDBMS Predefined schema
Strategic data placement for query tuning
Exploit indexes for fast retrieving
SQL only
Doesn’t scale linearly
• MapReduce/Hadoop No schema is required
Random data placement
Fast scan of the entire dataset
Uniform query performance
Linearly scales for reads and writes
Support many languages including SQL
Complementary technologies
![Page 48: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/48.jpg)
AWS Gov Cloud Summit II
AWS Data Warehousing Architecture
![Page 49: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/49.jpg)
AWS Gov Cloud Summit II
Elastic Data Warehouse • Customize cluster size to support varying resource needs (e.g. query
support during the day versus batch processing overnight)
• Reduce costs by increasing server utilization
• Improve performance during high usage periods
Expand to 25 instances
Data Warehouse
(Steady State)
Data Warehouse
(Batch Processing)
Shrink to 9 instances
Data Warehouse
(Steady State)
![Page 50: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/50.jpg)
AWS Gov Cloud Summit II
Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Reducing Costs with Spot Instances
Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50% Cost Savings: ~22%
![Page 51: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/51.jpg)
AWS Gov Cloud Summit II
Big Data Ecosystem And Tools
We have a rapidly growing ecosystem
• Business Intelligence – MicroStrategy, Pentaho
• Analytics – Datameer, Karmasphere, Quest
• Open source – Ganglia, Squirrel SQL
![Page 52: Analyzing Big Data with AWS](https://reader030.vdocuments.net/reader030/viewer/2022012011/613d5254736caf36b75bf21e/html5/thumbnails/52.jpg)
AWS Gov Cloud Summit II
Thank You!! http://aws.amazon.com/elasticmapreduce/