automate your big data workflows (svc201) | aws re:invent 2013

124
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. SVC201 - Automate Your Big Data Workflows Jinesh Varia, Technology Evangelist @jinman November 14, 2013

Upload: amazon-web-services

Post on 08-Sep-2014

2.020 views

Category:

Technology


6 download

DESCRIPTION

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.

TRANSCRIPT

Page 1: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

SVC201 - Automate Your Big Data Workflows

Jinesh Varia, Technology Evangelist

@jinman

November 14, 2013

Page 2: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Big Data Workflows

Automating Compute Automating Data

Page 3: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

DeciderWorker Starters

Activity Worker

Amazon SWF

Activity Worker

AWS Management ConsoleHistory

Amazon SWF – Your Distributed State Machine in the Cloud

SWF helps you scale your business logic

Page 4: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Tim JamesVijay Ramesh

- Data/science Architect, Manager- Data/science Engineer

Page 5: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

the world's largest petition platform

Page 6: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 7: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 8: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

At Change.org in the last year

• 120M+ signatures — 15% on victories• 4000 declared victories

Page 9: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 10: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 11: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 12: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.

Page 13: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

How?

Page 14: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

60-90% signatures at Change.orgdriven by email

Page 15: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 16: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.

Page 17: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.* up to a point!

*

Page 18: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t

scale.

Page 19: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

cognitively.

Page 20: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

in personnel.

Page 21: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

into mass customization.

Page 22: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

culturally or internationally.

Page 23: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

with data size and load.

Page 24: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

So what did we do?

Page 25: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

We used big-compute machine learning to automatically target our mass emails across each week’s set of campaigns.

Page 26: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

We started from here...

Page 27: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

And finished here...

Page 28: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

First: Incrementally extract (and verify) MySQL data to Amazon S3

Page 29: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Incrementally extract with high watermarking.

(not wall-clock intervals)

Page 30: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Verify data continuity after extract.

We used Cascading/Amazon EMR + Amazon SNS.

Page 31: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Transform extracted data on S3 into “Feature Matrix”using Cascading/Hadoop on Amazon Elastic MapReduce100-instance EMR cluster

Page 32: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A Feature Matrix is just a text file.

Sparse vector file line format, one line per user.

<user_id>[ <feature_id>:<feature_value>]...

Example:

123 12:0.237 18:1 101:0.578

Page 33: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

So how do we do

big-compute Machine Learning?

Page 34: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Enter Amazon • Simple Workflow Service• Elastic Compute Cloud

SWFEC2

Page 35: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF and EC2 allowed us to decouple:

• Control (and error) flow

• Task business logic• Compute resource provisioning

Page 36: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF provides a distributed application model

Page 37: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider processes make discrete workflow decisions

Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)

SWF provides a distributed application model

Page 38: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Allows deciders and workers to be implemented in any language.  

We used Rubywith ML calculations done by Python, R, or C.

SWF provides a distributed application model

Page 39: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Rich web interface via the AWS Management Console.

Flexible API for control and monitoring.

SWF provides a distributed application model

Page 40: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Resource Provisioning with EC2

Our EC2 instances each provide servicevia Simple Workflow Service

for a single Feature Matrix file.

Page 41: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Simplifying Assumption:

Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)

Page 42: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Treat compute resources as

hotel rooms, not mansions.

Page 43: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI)

EC2 instance tags provide highly visible, searchable configuration.

Update local git repo to configured software version.

Page 44: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

EC2 instance tags

Page 45: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Log bootstrap steps to S3mapping essential config tags to EC2 instance names and log files

Page 46: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.

Page 47: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in R&D for Training

• Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space

• Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python

Page 48: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in Production

• Train with single SWF worker using multiple cores (python multithreaded Random Forest)

• Predict with 8 SWF workers — 1 per core, 4 cores per instance

Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group

Page 49: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in Production

Page 50: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 51: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.

Page 52: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

So from here,

how can we expect this system to scale?

Page 53: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• Run more EMR instances to build Feature Matrix

• Run more SWF predict workersper campaign

for 10x users

Page 54: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• already automatically start a SWF worker group per campaign

• “user generated campaigns” require no campaigner time and are targeted automatically

for 10x campaigns

Page 55: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• system eliminates mass email targeting contention, so team can scale

for 2x+ campaigners

Page 56: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Win for our Campaigners... and Users.

Our user base can now be automatically segmented across a wide pool of campaigns, even internationally.

30%+ conversion boost over manual targeting.

Page 57: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 58: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Do you build systems like these?Do you want to?

We’d love to talk.(And yes, we’re hiring.)

Page 59: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

UNSILODr. Francisco Roque, Co-Founder and CTO

Page 60: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A collaborative search platform that helps you see patterns across Science & Innovation

Page 61: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Mission

UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific

terminologies

Page 62: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Describe Discover Analyze & Share

Unsilo

Page 63: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

New way of searching

Page 64: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Big Data Challenges

4.5 million USPTO granted patents

12 million scientific articles

Heterogeneous processing pipeline

(multiple steps, variable times)

Page 65: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A small test

1000 documents20 minutes/doc average

Page 66: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A bigger test

100k documents3.8 years?

Page 67: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A bigger test

100k documents8x8 cores~21 days

Page 68: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

4.5 million patents?12 million articles?

Page 69: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Focus on the goal

Page 70: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF to the rescue

• Scaling• Concurrency• Reliability• Flexibility to experiment• Easily adaptable

Page 71: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF makes it very easy to separate algorithmic logic and workflow logic

Page 72: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Easy to get started: First document batch running in just 2 weeks

Page 73: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

AWS services

Page 74: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Adding content

Page 75: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Job Loading

• Content loaded by traversing S3 buckets

• Reprocessing by traversing tables on DynamoDB

DynamoDB

Page 76: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decision Workers

• Crawls Workflow Historyfor Decision Tasks

• Schedules new ActivityTasks

DynamoDB

Page 77: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Activity Workers

• Read/write to S3• Status in DynamoDB• SWF task inputs passed

between workflow steps • Specialized workers

DynamoDB

Page 78: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best practice

Use DynamoDB for content status

Index on different columns (local indexes)

More efficient content status queriesGive me all the items that completed step X

Elastic service!

Page 79: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Key to scalability

File organization on S3 for scalability– 50 req/s naïve approach– >1500 req/seq

logs/2013-11-14T23:01:34/...logs/2013-11-14T23:01:23/...logs/2013-11-14T23:01:15/..."

43:10:32T41-11-3102/logs/...32:10:32T41-11-3102/logs/...51:10:32T41-11-3102/logs/..."

http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.htmlhttp://goo.gl/JnaQZV

Page 80: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Gearing

Ratio?

Page 81: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Monitoring

Give me all the workers/instances that have not responded in the past hour

Page 82: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF components

DynamoDB

Page 83: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Throttling and eventual consistency

Failed?Try Again

Page 84: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Development environment

Page 85: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Huge benefits

100k Documents21 days < 1 hour

4.5 Million USPTO~30 hours

Page 86: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Huge benefits

Focus on our goal, faster time to market

Using Spot instances, 1/10 cost

Page 87: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Key SWF Takeaways

Flexibility– Room for experimentation

Transparency– Easy to adapt

Growing with the system– Not constrained by the framework

Decider

Worker

Worker

Amazon SWF

Page 88: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

UNSILOwww.unsilo.com

Sign up to be invited for the Public Beta

Page 89: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 90: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Compute Resources

Data Data

Data Stores Data Stores

AWS Data Pipeline Your ETL in the Cloud

Page 91: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Inter-region ETL

S3 EMR S3 DynamoDBEMRS3 S3 RDSEC2

S3 RedshiftEMR DynamoDBEMRDynamoDB S3 Hive/Pig Redshift

Intra-region ETL Cloud-On-Prem ETL

AWS Data Pipeline Patterns (ActivityWorkers)

Page 92: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Fred Benenson, Data Engineer

Page 93: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A new way to fund creative projects:

All-or-nothing fundraising.

Page 94: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

5.1 million people have backed a project

Page 95: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

51,000+ successful projects

Page 96: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

44% of projects hit their goal

Page 97: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

$872 million pledged

Page 98: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

78% of projects raise under $10,000

51 projects raised more than $1 million

Page 99: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Project case study: Oculus Rift

Page 100: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Data @

• We have many different data sources

• Some relational data, like MySQL on Amazon RDS

• Other unstructured data like JSON stored in a

third-party service like Mixpanel

• What if we want to JOIN between them in Amazon

Redshift?

Page 101: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Case study: Find the users that have Page View A but not User Action B

• Page View A is instrumented in Mixpanel, a third-party service whose API we have access:

{ “Page View A”, { user_uid : 1231567, ... } }

• But User Action B is just the existence of a timestamp in a MySQL row:

6975, User Action B, 1231567, 2012-08-31 21:55:466976, User Action B, 9123811, NULL6977, User Action B, 2913811, NULL

Page 102: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Redshift to the Rescue!SELECTusers.id,COUNT(DISTINCTCASE WHEN user_actions.timestamp IS NOT NULLTHEN user_actions.id ELSE NULL

END) as event_b_countFROM usersINNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A'

LEFT JOIN user_actions ON user_actions.user_id = users.idGROUP BY users.id

Page 103: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

How we do automate the data flow to keep it fresh daily?

Page 104: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 105: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 106: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 107: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 108: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

But how do we get the data to Redshift?

Page 109: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This is where AWSData Pipeline comes in.

Page 110: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift - Step 1

First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.

AWS

Page 111: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift - Step 2

Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.

Page 112: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• 150 - 200 gigabytes• New DB every day, drop old tables

• Using AWS Data Pipeline’s 1-day ‘now’ schedule

Pipeline 1: RDS to Redshift - Transfer to S3

Page 113: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift Again

Run a similar pipeline job in parallel for our other database.

Page 114: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 2: Mixpanel to Redshift - Step 1

Spin up an EC2 instance to download the day’s data from Mixpanel.

Page 115: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Use Elastic MapReduce to transform Mixpanel’s unstructured JSON into CSVs.

Pipeline 2: Mixpanel to Redshift - Step 2

Page 116: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• 9-10 gb per day• Incremental data• 2.2+ billion events• Backfilled a year in 7 days

Pipeline 2: Mixpanel to Redshift - Transfer to S3

Page 117: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• JSON / CLI tools are crucial• Build scripts to generate JSON• ShellCommandActivity is powerful• Really invest time to understand

scheduling• Use S3 as the “transport” layer

AWS Data PipelineBest Practices

Page 118: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

AWS Data Pipeline Takeaways for Kickstarter

15 years ago: $1 million or more

5 years ago: Open source + staff & infrastructure

Now: ~$80 a month on AWS

Page 119: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

“It just works”

Page 120: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 121: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 122: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Big Thank You to Customer Speakers!

Jinesh Varia

@jinman

Page 123: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

More Sessions on SWF and AWS Data Pipeline

SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room)

BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)

Page 124: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

SVC201