big data architectural patterns

77
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arie Leeuwesteijn, Solutions Architect, AWS May 2016 Big Data Architectural Patterns and Best Practices on AWS

Upload: amazon-web-services

Post on 21-Apr-2017

10.763 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data Architectural Patterns

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Arie Leeuwesteijn, Solutions Architect, AWS

May 2016

Big Data Architectural Patterns and Best Practices on AWS

Page 2: Big Data Architectural Patterns

Agenda

Big data challengesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureSanoma Use CaseDesign patterns

Page 3: Big Data Architectural Patterns

Ever Increasing Big Data

Volume

Velocity

Variety

Page 4: Big Data Architectural Patterns

Big Data Evolution

Batch processing

Streamprocessing

Machinelearning

Page 5: Big Data Architectural Patterns

Plethora of Tools

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis

Kinesis-enabled app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Page 6: Big Data Architectural Patterns

Big Data Challenges

Is there a reference architecture?

What tools should I use?

How?

Why?

Page 7: Big Data Architectural Patterns

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Architectural Principles

Page 8: Big Data Architectural Patterns

Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

Time to answer (Latency)Throughput

Cost

Page 9: Big Data Architectural Patterns

COLLECT

Page 10: Big Data Architectural Patterns

Types of Data

Database records

Search documents

Log files

Messaging events

Devices / sensors / IoT streamDevices

Sensors & IoT platforms

AWS IoT STREAMS

StreamstorageIo

TCOLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDSDatabase

Appl

icat

ions

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES Queue

Mes

sagi

ng

Page 11: Big Data Architectural Patterns

Store

Page 12: Big Data Architectural Patterns

Amazon Kinesis Firehose

Amazon KinesisStreams

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed messaging

system

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices

Sensors & IoT platforms

AWS IoT STREAMS

IoT

COLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDSDatabase

Appl

icat

ions

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES

Mes

sagi

ng

Que

ueS

tream

Page 13: Big Data Architectural Patterns

Why Stream Storage?

Decouple producers & consumersPersistent bufferCollect multiple streams

Preserve client orderingStreaming MapReduceParallel consumption

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic

Page 14: Big Data Architectural Patterns

What About Queues & Pub/Sub ? • Decouple producers &

consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for

Amazon SQS• Amazon SNS can route

to multiple queues or ʎ functions

• No streaming MapReduce

Consumers

Producers

Producers

Amazon SNS

Amazon SQS

queue

topic

function

ʎ

AWS Lambda

Amazon SQSqueue

Subscriber

Page 15: Big Data Architectural Patterns

What Stream Storage should I use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS

AWS managed service

Yes Yes Yes No Yes

Guaranteedordering

Yes Yes Yes Yes No

Delivery exactly-once at-least-once exactly-once at-least-once at-least-once

Data retention period

24 hours 7 days N/A Configurable 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

Parallel clients Yes Yes No Yes No

Stream MapReduce Yes Yes N/A Yes N/A

Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm

Page 16: Big Data Architectural Patterns

COLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDSDatabase

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

MessagingMessage MESSAGES

Devices

Sensors & IoT platforms

AWS IoT STREAMS

Apache Kafka

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB Streams

Hot

Stre

am

Amazon S3

Amazon SQS

Mes

sage

Amazon S3File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

File Storage

Page 17: Big Data Architectural Patterns

Why is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost

Page 18: Big Data Architectural Patterns

What about HDFS & Amazon Glacier?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for infrequently accessed data

• Use Amazon Glacier for archiving cold data

Page 19: Big Data Architectural Patterns

Cache, database, search

COLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDS

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

MessagingMessage MESSAGES

Devices

Sensors & IoT platforms

AWS IoT STREAMS

Apache Kafka

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB Streams

Hot

Stre

am

Amazon SQS

Mes

sage

Amazon Elasticsearch Service

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

le

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Page 20: Big Data Architectural Patterns

Database Anti-pattern

Data TierDatabase tier

Page 21: Big Data Architectural Patterns

Best Practice - Use the Right Tool for the Job

Data TierSearch

Amazon Elasticsearch Service

Cache

AmazonElastiCache

RedisMemcached

SQL

Amazon AuroraMySQLPostgreSQLOracleSQL ServerMariaDB

NoSQL

Amazon DynamoDB

CassandraHBaseMongoDB

Database tier

Page 22: Big Data Architectural Patterns

Materialized Views

Amazon Elasticsearch Service

KCL

Page 23: Big Data Architectural Patterns

What Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data / access characteristics → Hot, warm, cold

Cost → Right cost

Page 24: Big Data Architectural Patterns

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQLCross table joins, transaction, SQL SQLFaceting, search, free text Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) Cache, NoSQL

Page 25: Big Data Architectural Patterns

What Is the Temperature of Your Data / Access ?

Page 26: Big Data Architectural Patterns

Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low – high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

Data / Access Characteristics: Hot, Warm, Cold

Page 27: Big Data Architectural Patterns

Cache SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Stru

ctur

e

NoSQL

Hot data Warm data Cold data

Low

High

Search

Page 28: Big Data Architectural Patterns

Amazon ElastiCache AmazonDynamoDB

AmazonRDS/Aurora

AmazonElasticsearch

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

hrs

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

Very high3 AZ

Hot data Warm data Cold data

What Data Store Should I Use?

Page 29: Big Data Architectural Patterns

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000

Page 30: Big Data Architectural Patterns

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

https://calculator.s3.amazonaws.com/index.html

Simple Monthly Calculator

Page 31: Big Data Architectural Patterns

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2,048 1,483 777,600,000

Amazon S3 orAmazon DynamoDB?

Page 32: Big Data Architectural Patterns

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use

Page 33: Big Data Architectural Patterns

COLLECT STORE PROCESS / ANALYZE

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leM

essa

geS

tream

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centers AWS Direct Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Process /analyze

Page 34: Big Data Architectural Patterns

PROCESS / ANALYZE

Page 35: Big Data Architectural Patterns

Process / Analyze

• Batch - Minutes or hours on cold data• Daily/weekly/monthly reports

• Interactive – Seconds on warm/cold data• Self-service dashboards

• Messaging – Milliseconds or seconds on hot data • Message/event buffering

• Streaming - Milliseconds or seconds on hot data• Billing/fraud alerts, 1 minute metrics

Page 36: Big Data Architectural Patterns

What Analytics Technology Should I Use?Amazon Redshift

Amazon EMR

Presto Spark Hive

Query latency

Low Low Low High

Durability High High High High

Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes

AWS managed

Yes Yes Yes Yes

Storage Native HDFS / S3 HDFS / S3 HDFS / S3

SQL compatibility

High High Low (SparkSQL) Medium (HQL)

Slow

Page 37: Big Data Architectural Patterns

Predictions via Machine Learning

ML gives computers the ability to learn without being explicitly programmed

Machine learning algorithms:Supervised learning ← “teach” program

- Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value?

Unsupervised learning ← Let it learn by itself- Clustering ← Market segmentation

Page 38: Big Data Architectural Patterns

Tools and FrameworksMachine Learning

• Amazon ML, Amazon EMR (Spark ML)Interactive

• Amazon Redshift, Amazon EMR (Presto, Spark)Batch

• Amazon EMR (MapReduce, Hive, Pig, Spark)Messaging

• Amazon SQS application on Amazon EC2Streaming

• Micro-batch: Spark Streaming, KCL• Real-time: Amazon Kinesis Analytics, Storm,

AWS Lambda, KCL

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

PROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Amazon EC2

Amazon EC2

Page 39: Big Data Architectural Patterns

What Streaming / Messaging Technology Should I Use?Spark Streaming

Apache Storm

Kinesis KCL Application

AWS Lambda Amazon SQS Apps

Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Micro-batch or Real-time

Micro-batch Real-time Near-real-time Near-real-time Near-real-time

AWS managed service

Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling)

Yes No (EC2 + AutoScaling)

Scalability No limits ~ nodes

No limits~ nodes

No limits~ nodes

No limits No limits

Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ

Programminglanguages

Java, Python, Scala

Any language via Thrift

Java, via MultiLangDaemon (.NET,Python, Ruby, Node.js)

Node.js, Java, Python

AWS SDK languages (Java, .NET, Python, …)

Page 40: Big Data Architectural Patterns

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Page 41: Big Data Architectural Patterns

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leM

essa

geS

tream

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centers AWS Direct Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Page 42: Big Data Architectural Patterns

CONSUME

Page 43: Big Data Architectural Patterns

STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Applications & API

Analysis and visualization

Notebooks

IDE

Business users

Data scientist, developers

COLLECT ETL

Page 44: Big Data Architectural Patterns

Putting It All Together

Page 45: Big Data Architectural Patterns

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leQ

ueue

Stre

am

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centers AWS Direct Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Reference architecture

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Page 46: Big Data Architectural Patterns

AWS SummitSanoma Big Data Migration CaseSander Kieft

Page 47: Big Data Architectural Patterns

Sanoma, Publishing and Learning company

27 May 2016 AWS Summit– Sanoma Big Data MIgration47

2+1002 Finnish newspapers

Over 100 magazines in The Netherland, Belgium and

Finland

7TV channels in Finland and The Netherlands, incl. on

demand platforms

200+Websites

100Mobile applications on

various mobile platforms

30+Learning applications

Page 48: Big Data Architectural Patterns

§ Users of Sanoma’s websites, mobile applications, online TV products generate large volumes of data§ We use this data to improve our products for our users and our advertisers§ Use cases:

Sanoma, Big Data use cases

27 May 2016 AWS Summit– Sanoma Big Data MIgration48

Dashboardsand Reporting

Product Improvements

Advertising optimization

• Reporting on various data sources• Self Service Data Access• Editorial Guidance

• A/B testing• Recommenders• Search optimizations• Adaptive learning/tutoring

• (Re)Targeting• Attribution• Auction price optimization

Page 49: Big Data Architectural Patterns

Starting point

27 May 2016 AWS Summit– Sanoma Big Data MIgration4949

ETL(Jenkins & Jython)

QlikViewReporting

Hive

Collecting

Serving

R Studio

Stream processing

(Storm)Kafka

Sources

EDW

Redis /Druid.io

Compute & Storage(Cloudera Dist.

Hadoop)

AWS cloud

Sanoma data center

S3Hadoop User Environment

(HUE)

Page 50: Big Data Architectural Patterns
Page 51: Big Data Architectural Patterns

60+NODES

Page 52: Big Data Architectural Patterns

720TBCAPACITY

Page 53: Big Data Architectural Patterns

475TBDATA STORED (GROSS)

Page 54: Big Data Architectural Patterns

180TBDATA STORED (NETTO)

Page 55: Big Data Architectural Patterns

150GBDAILY GROWTH

Page 56: Big Data Architectural Patterns

50+DATA SOURCES

Page 57: Big Data Architectural Patterns

200+DATA PROCESSES

Page 58: Big Data Architectural Patterns

3000+DAILY HADOOP JOBS

Page 59: Big Data Architectural Patterns

200+DASHBOARDS

Page 60: Big Data Architectural Patterns

275+AVG MONTHLY DASHBOARD USERS

Page 61: Big Data Architectural Patterns

Challenges

Positives§ Users have been steadily increasing§ Demand for ..

– more real time processing– faster data availability– higher availability (SLA)– quicker availability of new versions– specialized hardware (GPU’s/SSD’s)– quicker experiments (Fail Fast)

Negatives§ Data Center network support end of life§ Outsourcing of own operations team§ Version upgrades harder to manage§ Poor job isolation between test, dev, prod and

interactive, etl and data science workloads§ Higher level of security and access control

27 May 2016 AWS Summit– Sanoma Big Data MIgration61

Page 62: Big Data Architectural Patterns

§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s

27 May 2016 AWS Summit– Sanoma Big Data MIgration62

SM

L

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Page 63: Big Data Architectural Patterns

Easier to leverage spot pricing, due to data on S3

§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s

27 May 2016 AWS Summit– Sanoma Big Data MIgration63

SM

L

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Page 64: Big Data Architectural Patterns

Target architecture

27 May 2016 AWS Summit– Sanoma Big Data MIgration6464

ETL(Jenkins & Jython)

QlikViewReporting

R Studio

Sources

EDW

AWS cloud

Sanoma data center

AmazonRDS

S3

S3 Buckets

Sources Work Hive Datawarehouse

EMR Cluster 1

Coreinstances

Taskinstances

TaskSpot

Instances

Amazon EMR

HUE

Zeppelin

Hive

Collecting

Serving

Stream processing

(Storm)Kafka

Redis /Druid.io

Page 65: Big Data Architectural Patterns

§ We’re almost done with the migration. Running in parallel now.§ Setup solves our challenges, some still require work§ Take your time testing and validating your setup

– If you have time, rethink whole setup– No time, move as-is first and then start optimizing

Conclusion & Learnings

27 May 2016 AWS Summit– Sanoma Big Data MIgration65

EMR– Check data formats! RC vs ORC/Parquet– Start jobs from master node– Break up long running jobs, shorter

independent – Spot pricing & pre-empted nodes /w Spark– HUE and Zeppelin meta data on RDS– Research EMR FS behavior for your use case

§ S3– Bucket structure impacts performance– Setup access control, human error will occur– Uploading data takes time. Start early!

§ Check Snowball or new upload service

Page 66: Big Data Architectural Patterns
Page 67: Big Data Architectural Patterns

Design Patterns

Page 68: Big Data Architectural Patterns

Primitive: Multi-Stage Decoupled “Data Bus”

Multiple stagesStorage decouples multiple processing stages

Store Process Store Process

processstore

Page 69: Big Data Architectural Patterns

Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis

Amazon Kinesis

AWS Lambda

Amazon DynamoDB

Amazon Kinesis S3Connector

processstore

Amazon S3

Page 70: Big Data Architectural Patterns

Amazon EMR

Primitive: Analysis Frameworks Could Read from Multiple Data Stores

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis S3Connector

process store

Spark SQL

Page 71: Big Data Architectural Patterns

Spark Streaming Apache StormAWS Lambda

KCL appsAmazon Redshift

AmazonRedshift

Hive

Spark Presto

Amazon KinesisApache Kafka

Amazon DynamoDB Amazon S3data

Hot ColdData temperature

Proc

essi

ng s

peed

Fast

Slow Answers

Hive

Native appsKCL appsAWS Lambda

Data Temperature vs. Processing Speed

Page 72: Big Data Architectural Patterns

Amazon EMR

Real-time Analytics

ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time prediction

KPI

processstore

DynamoDBStreams

Amazon Kinesis

Page 73: Big Data Architectural Patterns

Message / EventProcessing

Amazon SQS

Amazon SQS App

Amazon SQS App

Amazon SNS Subscribers

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Publish

App state

KPI

processstore

Amazon SQS App

Amazon SQSApp

Auto Scaling group

Amazon SQSPriority queue Messages /

events

Page 74: Big Data Architectural Patterns

Interactive & BatchAnalytics

Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMR

Presto

Spark

Batch

Interactive

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Page 75: Big Data Architectural Patterns

Batch layer

AmazonKinesis

Datastream

processstore

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed layer

answer

Serving layer

AmazonElastiCache

AmazonDynamoDBAmazon

RDSAmazon

ES

answer

AmazonML

KCL

AWS Lambda

Storm

Lambda Architecture

Spark Streaming on Amazon EMR

Page 76: Big Data Architectural Patterns

Summary

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Page 77: Big Data Architectural Patterns

Thank you!aws.amazon.com/big-data