orchestrate your big data workflows with aws data pipeline
DESCRIPTION
Amazon offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premises data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we teach you how to use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity. With AWS Data Pipeline, it's easier to reap the benefits of Big Data technology.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Anthony Accardi (Head of Engineering, Swipely)
November 14, 2013
Orchestrating Big Data Integrationand Analytics Data Flows withAWS Data PipelineJon Einkauf (Sr. Product Manager, AWS)
Friday, November 15, 13
What are some of the challenges in dealing with data?
Friday, November 15, 13
1. Data is stored in different formats andlocations, making it hard to integrate
Amazon Redshift
Amazon S3
Amazon EMRAmazon DynamoDB
Amazon RDS
On-Premises
Friday, November 15, 13
2. Data workflows require complexdependencies
Input Data Ready? Run…
No
Yes
• For example, a data processing step may depend on:• Input data being ready • Prior step completing• Time of day• Etc.
Friday, November 15, 13
3. Things go wrong - you must handle exceptions
• For example, do you want to:
• Retry in the case of failure?
• Wait if a dependent step is taking longer than expected?
• Be notified if something goes wrong?
Friday, November 15, 13
4. Existing tools are not a good fit
• Expensive upfront licenses• Scaling issues• Don’t support scheduling• Not designed for the cloud• Don’t support newer data stores (e.g., Amazon DynamoDB)
Friday, November 15, 13
Introducing AWS Data Pipeline
Friday, November 15, 13
A simple pipeline
Input DataNode with PreCondition check
Activity with failure & delay notifications
Output DataNode
Friday, November 15, 13
Amazon Redshift
Amazon S3
Amazon EMRAmazon DynamoDB
Amazon RDS
Activities
Manages scheduled data movement andprocessing across AWS services
• Copy• MapReduce• Hive• Pig (New)• SQL (New)• Shell command
Friday, November 15, 13
Amazon Redshift
Amazon S3
Amazon EMRAmazon DynamoDB
Amazon RDS
On-Premises
Facilitates periodic data movement to/from AWS
Friday, November 15, 13
Supports dependencies (Preconditions)
• Amazon DynamoDB table exists/has data• Amazon S3 key exists• Amazon S3 prefix is not empty• Success of custom Unix/Linux shell command• Success of other pipeline tasks
S3 key exists? Copy…
No
Yes
Friday, November 15, 13
Alerting and exception handling
• Notification• On failure• On delay
• Automatic retry logic
Task 1
Success Failure
Alert
Task 2
Success Failure
Alert
Friday, November 15, 13
Flexible scheduling
• Choose a schedule• Run every: 15 minutes, hour, day, week, etc.• User defined
• Backfill support• Start pipeline on past date• Rapidly backfills to present day
Friday, November 15, 13
Massively scalable
• Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data
• Manage resources in multiple regions
Friday, November 15, 13
Easy to get started
• Templates for common use cases
• Graphical interface• Natively understands
CSV and TSV• Automatically
configures Amazon EMR clusters
Friday, November 15, 13
Inexpensive
• Free tier• Pay per activity/precondition• No commitment• Simple pricing:
Friday, November 15, 13
An ETL example (1 of 2)• Combine logs in Amazon S3 with customer data in Amazon RDS• Process using Hive on Amazon EMR• Put output in Amazon S3• Load into Amazon Redshift• Run SQL query and load table for BI tools
Friday, November 15, 13
An ETL example (2 of 2)• Run on a schedule (e.g. hourly)• Use a precondition to make Hive activity depend on Amazon S3 logs being available• Set up Amazon SNS notification on failure• Change default retry logic
Friday, November 15, 13
Swipely
Friday, November 15, 13
1 TB
How big is your data?
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Don’t use Hadoop: your data isn’t that big.
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Don’t use Hadoop: your data isn’t that big.
Keep your data smalland manageable.
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Decrease development timeby easily backfilling data
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Decrease development timeby easily backfilling data
Vastly simplify operationswith scalable on-demand services
Friday, November 15, 13
Friday, November 15, 13
must innovateby making payments data actionable
Friday, November 15, 13
must innovateby making payments data actionable
and rapidly iteratedeploying multiple times a day
Friday, November 15, 13
must innovateby making payments data actionable
and rapidly iteratedeploying multiple times a day
with a lean team.we have 2 ops engineers
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Fast, dynamic reportsby mashing up datafrom facts.
Friday, November 15, 13
Generate fast, dynamic reports
Friday, November 15, 13
Friday, November 15, 13
insert
AWS Data Pipeline orchestratesbuilding of documents from facts
TransactionFacts
IntermediateS3 Bucket
Sales by DayDocuments
EMR
Friday, November 15, 13
insert
AWS Data Pipeline orchestratesbuilding of documents from facts
TransactionFacts
IntermediateS3 Bucket
EMR Data Transformer
Data Post-Processor
Sales by DayDocuments
EMR
Friday, November 15, 13
insert
AWS Data Pipeline orchestratesbuilding of documents from facts
TransactionFacts
IntermediateS3 Bucket
EMR Data Transformer
Data Post-Processor
Sales by DayDocuments
EMR
AWS Data Pipeline
Friday, November 15, 13
Friday, November 15, 13
Mash up data for efficient processing
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57
Cafe 5/10: $4030
Cafe 5/11: $5432
Cafe 5/12: $6292
Transactions Sales by Day
EMR
Friday, November 15, 13
Friday, November 15, 13
Mash up data for efficient processing
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57
Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new
Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new
VisitsTransactions Sales by Day
EMR EMR
Friday, November 15, 13
Friday, November 15, 13
Mash up data for efficient processing
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57
Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new
Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new
Mary 5/11: $309
4980 5/11: $218
Bob 5/11: $198
2472 Bob
8278 Mary
Customer SpendCard Opt-In
VisitsTransactions Sales by Day
EMR EMR
Hive (EMR)
Friday, November 15, 13
insert
AWS Data Pipeline orchestratesbuilding of documents from facts
TransactionFacts
IntermediateS3 Bucket
EMR Data Transformer
Data Post-Processor
Sales by DayDocuments
EMR
AWS Data Pipeline
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Regularly rebuildto rapidly iterate,using agile process.
Friday, November 15, 13
Regularly rebuild to avoid backfilling
web service
transactionscard opt-in
AnalyticsDocuments
FactStore
daily
Friday, November 15, 13
Regularly rebuild to avoid backfilling
web service
transactionscard opt-in
AnalyticsDocuments
FactStore
daily
RecentActivity
Friday, November 15, 13
Minor changes require little work
Friday, November 15, 13
Minor changes require little work
change accounting ruleswithout a migration
Friday, November 15, 13
Rapidly iterate your product
Friday, November 15, 13
Rapidly iterate your product
redefine “best”
Friday, November 15, 13
Leverage agile development process
Wrap pipeline definition
Quickly diagnose failures
Automate common tasks
Reduce variability
Friday, November 15, 13
Wrap pipeline definition {
"id": "GenerateSalesByDay",
"type": "EmrActivity",
"onFail": { "ref": "FailureNotify" },
"schedule": { "ref": "Nightly" },
"runsOn": { "ref": "SalesByDayEMRCluster" },
"dependsOn": { "ref": "GenerateIndexedSwipes" },
"step": "/.../hadoop-streaming.jar,
-input, s3n://<%= s3_data_path %>/indexed_swipes.csv,
-output, s3://<%= s3_data_path %>/sales_by_day,
-mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb,
-reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb"
}
Friday, November 15, 13
{
"id": "GenerateSalesByDay",
"type": "EmrActivity",
"onFail": { "ref": "FailureNotify" },
"schedule": { "ref": "Nightly" },
"runsOn": { "ref": "SalesByDayEMRCluster" },
"dependsOn": { "ref": "GenerateIndexedSwipes" },
"step": "<%= streaming_hadoop_step(
input: '/indexed_swipes.csv',
output: '/sales_by_day',
mapper: '/sales_by_day_mapper.rb',
reducer: '/sales_by_day_reducer.rb'
) %>"
}
Wrap pipeline definition
Friday, November 15, 13
Reduce variability
No small instances "coreInstanceType": "m1.large"
Lock versions "installHive": "0.8.1.8"
Security groups by database "securityGroups": [ "customerdb" ]
Friday, November 15, 13
Turn on logging "enableDebugging", "logUri", "emrLogUri"
Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"
Log into dev instances "keyPair"
Quickly diagnose failures
Friday, November 15, 13
Clean up "terminateAfter": "6 hours"
Bootstrap your environment
Automate common tasks
{
"id": "BootstrapEnvironment",
"type": "ShellCommandActivity",
"scriptUri": ".../bootstrap_ec2.sh",
"runsOn": { "ref": "SalesByDayEC2Resource" }
}
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.Scale horizontally,backfilling in 50 min,storing all your data.
Friday, November 15, 13
Scale Amazon EMR pipelines horizontally
Friday, November 15, 13
Scale Amazon EMR pipelines horizontally
Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Friday, November 15, 13
Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time
Friday, November 15, 13
Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Friday, November 15, 13
Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour~10 min runtime variability
Friday, November 15, 13
Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour~10 min runtime variability
Crunch 50 GB facts in 50 minusing 40 instances for < $10
Friday, November 15, 13
Store all your data - it’s cheap
Friday, November 15, 13
Store all your data - it’s cheap
Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month
Friday, November 15, 13
Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month
Store all your data - it’s cheap
Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month
Friday, November 15, 13
Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month
Store all your data - it’s cheap
Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month
Retain intermediate data in Amazon S3for diagnosis: 1.1 TB (60 days), $100 / month
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
Friday, November 15, 13
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT207 Thank You
Friday, November 15, 13