Transcript
Page 1: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

SVC201 - Automate Your Big Data Workflows

Jinesh Varia, Technology Evangelist

@jinman

November 14, 2013

Page 2: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Big Data Workflows

Automating Compute Automating Data

Page 3: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

DeciderWorker Starters

Activity Worker

Amazon SWF

Activity Worker

AWS Management ConsoleHistory

Amazon SWF – Your Distributed State Machine in the Cloud

SWF helps you scale your business logic

Page 4: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Tim JamesVijay Ramesh

- Data/science Architect, Manager- Data/science Engineer

Page 5: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

the world's largest petition platform

Page 6: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 7: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 8: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

At Change.org in the last year

• 120M+ signatures — 15% on victories• 4000 declared victories

Page 9: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 10: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 11: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 12: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.

Page 13: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

How?

Page 14: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

60-90% signatures at Change.orgdriven by email

Page 15: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 16: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.

Page 17: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This works.* up to a point!

*

Page 18: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t

scale.

Page 19: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

cognitively.

Page 20: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

in personnel.

Page 21: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

into mass customization.

Page 22: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

culturally or internationally.

Page 23: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Manual Targeting doesn’t scale

with data size and load.

Page 24: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

So what did we do?

Page 25: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

We used big-compute machine learning to automatically target our mass emails across each week’s set of campaigns.

Page 26: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

We started from here...

Page 27: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

And finished here...

Page 28: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

First: Incrementally extract (and verify) MySQL data to Amazon S3

Page 29: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Incrementally extract with high watermarking.

(not wall-clock intervals)

Page 30: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Verify data continuity after extract.

We used Cascading/Amazon EMR + Amazon SNS.

Page 31: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Transform extracted data on S3 into “Feature Matrix”using Cascading/Hadoop on Amazon Elastic MapReduce100-instance EMR cluster

Page 32: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A Feature Matrix is just a text file.

Sparse vector file line format, one line per user.

<user_id>[ <feature_id>:<feature_value>]...

Example:

123 12:0.237 18:1 101:0.578

Page 33: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

So how do we do

big-compute Machine Learning?

Page 34: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Enter Amazon • Simple Workflow Service• Elastic Compute Cloud

SWFEC2

Page 35: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF and EC2 allowed us to decouple:

• Control (and error) flow

• Task business logic• Compute resource provisioning

Page 36: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF provides a distributed application model

Page 37: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider processes make discrete workflow decisions

Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)

SWF provides a distributed application model

Page 38: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Allows deciders and workers to be implemented in any language.  

We used Rubywith ML calculations done by Python, R, or C.

SWF provides a distributed application model

Page 39: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Rich web interface via the AWS Management Console.

Flexible API for control and monitoring.

SWF provides a distributed application model

Page 40: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Resource Provisioning with EC2

Our EC2 instances each provide servicevia Simple Workflow Service

for a single Feature Matrix file.

Page 41: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Simplifying Assumption:

Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)

Page 42: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Treat compute resources as

hotel rooms, not mansions.

Page 43: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI)

EC2 instance tags provide highly visible, searchable configuration.

Update local git repo to configured software version.

Page 44: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

EC2 instance tags

Page 45: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Log bootstrap steps to S3mapping essential config tags to EC2 instance names and log files

Page 46: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.

Page 47: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in R&D for Training

• Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space

• Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python

Page 48: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in Production

• Train with single SWF worker using multiple cores (python multithreaded Random Forest)

• Predict with 8 SWF workers — 1 per core, 4 cores per instance

Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group

Page 49: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Provisioning in Production

Page 50: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 51: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best Practice:

Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.

Page 52: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

So from here,

how can we expect this system to scale?

Page 53: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• Run more EMR instances to build Feature Matrix

• Run more SWF predict workersper campaign

for 10x users

Page 54: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• already automatically start a SWF worker group per campaign

• “user generated campaigns” require no campaigner time and are targeted automatically

for 10x campaigns

Page 55: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Forward scale

• system eliminates mass email targeting contention, so team can scale

for 2x+ campaigners

Page 56: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Win for our Campaigners... and Users.

Our user base can now be automatically segmented across a wide pool of campaigns, even internationally.

30%+ conversion boost over manual targeting.

Page 57: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 58: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Do you build systems like these?Do you want to?

We’d love to talk.(And yes, we’re hiring.)

Page 59: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

UNSILODr. Francisco Roque, Co-Founder and CTO

Page 60: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A collaborative search platform that helps you see patterns across Science & Innovation

Page 61: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Mission

UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific

terminologies

Page 62: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Describe Discover Analyze & Share

Unsilo

Page 63: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

New way of searching

Page 64: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Big Data Challenges

4.5 million USPTO granted patents

12 million scientific articles

Heterogeneous processing pipeline

(multiple steps, variable times)

Page 65: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A small test

1000 documents20 minutes/doc average

Page 66: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A bigger test

100k documents3.8 years?

Page 67: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A bigger test

100k documents8x8 cores~21 days

Page 68: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

4.5 million patents?12 million articles?

Page 69: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Focus on the goal

Page 70: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF to the rescue

• Scaling• Concurrency• Reliability• Flexibility to experiment• Easily adaptable

Page 71: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

SWF makes it very easy to separate algorithmic logic and workflow logic

Page 72: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Easy to get started: First document batch running in just 2 weeks

Page 73: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

AWS services

Page 74: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Adding content

Page 75: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Job Loading

• Content loaded by traversing S3 buckets

• Reprocessing by traversing tables on DynamoDB

DynamoDB

Page 76: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decision Workers

• Crawls Workflow Historyfor Decision Tasks

• Schedules new ActivityTasks

DynamoDB

Page 77: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Activity Workers

• Read/write to S3• Status in DynamoDB• SWF task inputs passed

between workflow steps • Specialized workers

DynamoDB

Page 78: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Best practice

Use DynamoDB for content status

Index on different columns (local indexes)

More efficient content status queriesGive me all the items that completed step X

Elastic service!

Page 79: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Key to scalability

File organization on S3 for scalability– 50 req/s naïve approach– >1500 req/seq

logs/2013-11-14T23:01:34/...logs/2013-11-14T23:01:23/...logs/2013-11-14T23:01:15/..."

43:10:32T41-11-3102/logs/...32:10:32T41-11-3102/logs/...51:10:32T41-11-3102/logs/..."

http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.htmlhttp://goo.gl/JnaQZV

Page 80: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Gearing

Ratio?

Page 81: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Monitoring

Give me all the workers/instances that have not responded in the past hour

Page 82: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Amazon SWF components

DynamoDB

Page 83: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Throttling and eventual consistency

Failed?Try Again

Page 84: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Development environment

Page 85: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Huge benefits

100k Documents21 days < 1 hour

4.5 Million USPTO~30 hours

Page 86: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Huge benefits

Focus on our goal, faster time to market

Using Spot instances, 1/10 cost

Page 87: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Key SWF Takeaways

Flexibility– Room for experimentation

Transparency– Easy to adapt

Growing with the system– Not constrained by the framework

Decider

Worker

Worker

Amazon SWF

Page 88: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

UNSILOwww.unsilo.com

Sign up to be invited for the Public Beta

Page 89: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 90: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Compute Resources

Data Data

Data Stores Data Stores

AWS Data Pipeline Your ETL in the Cloud

Page 91: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Inter-region ETL

S3 EMR S3 DynamoDBEMRS3 S3 RDSEC2

S3 RedshiftEMR DynamoDBEMRDynamoDB S3 Hive/Pig Redshift

Intra-region ETL Cloud-On-Prem ETL

AWS Data Pipeline Patterns (ActivityWorkers)

Page 92: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Fred Benenson, Data Engineer

Page 93: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

A new way to fund creative projects:

All-or-nothing fundraising.

Page 94: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

5.1 million people have backed a project

Page 95: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

51,000+ successful projects

Page 96: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

44% of projects hit their goal

Page 97: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

$872 million pledged

Page 98: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

78% of projects raise under $10,000

51 projects raised more than $1 million

Page 99: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Project case study: Oculus Rift

Page 100: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Data @

• We have many different data sources

• Some relational data, like MySQL on Amazon RDS

• Other unstructured data like JSON stored in a

third-party service like Mixpanel

• What if we want to JOIN between them in Amazon

Redshift?

Page 101: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Case study: Find the users that have Page View A but not User Action B

• Page View A is instrumented in Mixpanel, a third-party service whose API we have access:

{ “Page View A”, { user_uid : 1231567, ... } }

• But User Action B is just the existence of a timestamp in a MySQL row:

6975, User Action B, 1231567, 2012-08-31 21:55:466976, User Action B, 9123811, NULL6977, User Action B, 2913811, NULL

Page 102: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Redshift to the Rescue!SELECTusers.id,COUNT(DISTINCTCASE WHEN user_actions.timestamp IS NOT NULLTHEN user_actions.id ELSE NULL

END) as event_b_countFROM usersINNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A'

LEFT JOIN user_actions ON user_actions.user_id = users.idGROUP BY users.id

Page 103: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

How we do automate the data flow to keep it fresh daily?

Page 104: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 105: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 106: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 107: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Page 108: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

But how do we get the data to Redshift?

Page 109: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

This is where AWSData Pipeline comes in.

Page 110: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift - Step 1

First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.

AWS

Page 111: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift - Step 2

Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.

Page 112: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• 150 - 200 gigabytes• New DB every day, drop old tables

• Using AWS Data Pipeline’s 1-day ‘now’ schedule

Pipeline 1: RDS to Redshift - Transfer to S3

Page 113: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 1: RDS to Redshift Again

Run a similar pipeline job in parallel for our other database.

Page 114: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Pipeline 2: Mixpanel to Redshift - Step 1

Spin up an EC2 instance to download the day’s data from Mixpanel.

Page 115: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Use Elastic MapReduce to transform Mixpanel’s unstructured JSON into CSVs.

Pipeline 2: Mixpanel to Redshift - Step 2

Page 116: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• 9-10 gb per day• Incremental data• 2.2+ billion events• Backfilled a year in 7 days

Pipeline 2: Mixpanel to Redshift - Transfer to S3

Page 117: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

• JSON / CLI tools are crucial• Build scripts to generate JSON• ShellCommandActivity is powerful• Really invest time to understand

scheduling• Use S3 as the “transport” layer

AWS Data PipelineBest Practices

Page 118: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

AWS Data Pipeline Takeaways for Kickstarter

15 years ago: $1 million or more

5 years ago: Open source + staff & infrastructure

Now: ~$80 a month on AWS

Page 119: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

“It just works”

Page 120: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 121: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Page 122: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Big Thank You to Customer Speakers!

Jinesh Varia

@jinman

Page 123: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

More Sessions on SWF and AWS Data Pipeline

SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room)

BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)

Page 124: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

SVC201


Top Related