big data architectural patterns

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Arie Leeuwesteijn, Solutions Architect, AWS

May 2016

Big Data Architectural Patterns and Best Practices on AWS

Agenda

Big data challengesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureSanoma Use CaseDesign patterns

Ever Increasing Big Data

Volume

Velocity

Variety

Big Data Evolution

Batch processing

Streamprocessing

Machinelearning

Plethora of Tools

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis

Kinesis-enabled app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Big Data Challenges

Is there a reference architecture?

What tools should I use?

How?

Why?

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Architectural Principles

Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

Time to answer (Latency)Throughput

Cost

COLLECT

Types of Data

Database records

Search documents

Log files

Messaging events

Devices / sensors / IoT streamDevices

Sensors & IoT platforms

AWS IoT STREAMS

StreamstorageIo

TCOLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDSDatabase

Appl

icat

ions

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES Queue

Mes

sagi

ng

Amazon Kinesis Firehose

Amazon KinesisStreams

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed messaging

system

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices


AWS IoT STREAMS

IoT

COLLECT STORE

Mobile apps

Web apps


RECORDSDatabase

Appl

icat

ions


Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES

Mes

sagi

ng

Que

ueS

tream

Why Stream Storage?

Decouple producers & consumersPersistent bufferCollect multiple streams

Preserve client orderingStreaming MapReduceParallel consumption

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic

What About Queues & Pub/Sub ? • Decouple producers &

consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for

Amazon SQS• Amazon SNS can route

to multiple queues or ʎ functions

• No streaming MapReduce

Consumers

Producers

Producers

Amazon SNS

Amazon SQS

queue

topic

function

ʎ

AWS Lambda

Amazon SQSqueue

Subscriber

What Stream Storage should I use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS

AWS managed service

Yes Yes Yes No Yes

Guaranteedordering

Yes Yes Yes Yes No

Delivery exactly-once at-least-once exactly-once at-least-once at-least-once

Data retention period

24 hours 7 days N/A Configurable 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

Parallel clients Yes Yes No Yes No

Stream MapReduce Yes Yes N/A Yes N/A

Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm

COLLECT STORE

Mobile apps

Web apps


RECORDSDatabase


Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search


Devices


AWS IoT STREAMS

Apache Kafka




Hot

Stre

am

Amazon S3

Amazon SQS

Mes

sage

Amazon S3File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

File Storage

Why is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost

What about HDFS & Amazon Glacier?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for infrequently accessed data

• Use Amazon Glacier for archiving cold data

Cache, database, search

COLLECT STORE

Mobile apps

Web apps


RECORDS


Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES


Devices


AWS IoT STREAMS

Apache Kafka




Hot

Stre

am

Amazon SQS

Mes

sage

Amazon Elasticsearch Service

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

le

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Database Anti-pattern

Data TierDatabase tier

Best Practice - Use the Right Tool for the Job

Data TierSearch


Cache

AmazonElastiCache

RedisMemcached

SQL

Amazon AuroraMySQLPostgreSQLOracleSQL ServerMariaDB

NoSQL

Amazon DynamoDB

CassandraHBaseMongoDB

Database tier

Materialized Views


KCL

What Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data / access characteristics → Hot, warm, cold

Cost → Right cost

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQLCross table joins, transaction, SQL SQLFaceting, search, free text Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) Cache, NoSQL

What Is the Temperature of Your Data / Access ?

Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low – high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

Data / Access Characteristics: Hot, Warm, Cold

Cache SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Stru

ctur

e

NoSQL


Low

High

Search

Amazon ElastiCache AmazonDynamoDB

AmazonRDS/Aurora

AmazonElasticsearch

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

hrs

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

Very high3 AZ


What Data Store Should I Use?

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

https://calculator.s3.amazonaws.com/index.html

Simple Monthly Calculator


Object size(Bytes)


Objects per month

300 2,048 1,483 777,600,000

Amazon S3 orAmazon DynamoDB?


Object size(Bytes)


Objects per month

Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use

COLLECT STORE PROCESS / ANALYZE


Apache Kafka

Amazon SQS



Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS


Hot

Hot

War

m

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leM

essa

geS

tream

Mobile apps

Web apps

Devices

MessagingMessage


AWS IoT



Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Process /analyze

PROCESS / ANALYZE

Process / Analyze

• Batch - Minutes or hours on cold data• Daily/weekly/monthly reports

• Interactive – Seconds on warm/cold data• Self-service dashboards

• Messaging – Milliseconds or seconds on hot data • Message/event buffering

• Streaming - Milliseconds or seconds on hot data• Billing/fraud alerts, 1 minute metrics

What Analytics Technology Should I Use?Amazon Redshift

Amazon EMR

Presto Spark Hive

Query latency

Low Low Low High

Durability High High High High

Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes

AWS managed

Yes Yes Yes Yes

Storage Native HDFS / S3 HDFS / S3 HDFS / S3

SQL compatibility

High High Low (SparkSQL) Medium (HQL)

Slow

Predictions via Machine Learning

ML gives computers the ability to learn without being explicitly programmed

Machine learning algorithms:Supervised learning ← “teach” program

- Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value?

Unsupervised learning ← Let it learn by itself- Clustering ← Market segmentation

Tools and FrameworksMachine Learning

• Amazon ML, Amazon EMR (Spark ML)Interactive

• Amazon Redshift, Amazon EMR (Presto, Spark)Batch

• Amazon EMR (MapReduce, Hive, Pig, Spark)Messaging

• Amazon SQS application on Amazon EC2Streaming

• Micro-batch: Spark Streaming, KCL• Real-time: Amazon Kinesis Analytics, Storm,

AWS Lambda, KCL

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

PROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Amazon EC2

Amazon EC2

What Streaming / Messaging Technology Should I Use?Spark Streaming

Apache Storm

Kinesis KCL Application

AWS Lambda Amazon SQS Apps

Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Micro-batch or Real-time

Micro-batch Real-time Near-real-time Near-real-time Near-real-time

AWS managed service

Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling)

Yes No (EC2 + AutoScaling)

Scalability No limits ~ nodes

No limits~ nodes

No limits~ nodes

No limits No limits

Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ

Programminglanguages

Java, Python, Scala

Any language via Thrift

Java, via MultiLangDaemon (.NET,Python, Ruby, Node.js)

Node.js, Java, Python

AWS SDK languages (Java, .NET, Python, …)

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Amazon SQS apps

Streaming


Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE


Presto

AmazonEMR


Apache Kafka

Amazon SQS



Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS


Hot

Hot

War

m

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leM

essa

geS

tream

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage


AWS IoT



Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

CONSUME

STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Applications & API

Analysis and visualization

Notebooks

IDE

Business users

Data scientist, developers

COLLECT ETL

Putting It All Together

Amazon SQS apps

Streaming


Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE


Presto

AmazonEMR


Apache Kafka

Amazon SQS



Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS


Hot

Hot

War

m

Fast

Slow

Fast

Bat

chM

essa

geIn

tera

ctiv

eS

tream

ML

Sea

rch

S

QL

N

oSQ

L

Cac

heFi

leQ

ueue

Stre

am

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage


AWS IoT



Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Reference architecture

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

AWS SummitSanoma Big Data Migration CaseSander Kieft

Sanoma, Publishing and Learning company

27 May 2016 AWS Summit– Sanoma Big Data MIgration47

2+1002 Finnish newspapers

Over 100 magazines in The Netherland, Belgium and

Finland

7TV channels in Finland and The Netherlands, incl. on

demand platforms

200+Websites

100Mobile applications on

various mobile platforms

30+Learning applications

§ Users of Sanoma’s websites, mobile applications, online TV products generate large volumes of data§ We use this data to improve our products for our users and our advertisers§ Use cases:

Sanoma, Big Data use cases


Dashboardsand Reporting

Product Improvements

Advertising optimization

• Reporting on various data sources• Self Service Data Access• Editorial Guidance

• A/B testing• Recommenders• Search optimizations• Adaptive learning/tutoring

• (Re)Targeting• Attribution• Auction price optimization

Starting point


ETL(Jenkins & Jython)

QlikViewReporting

Hive

Collecting

Serving

R Studio

Stream processing

(Storm)Kafka

Sources

EDW

Redis /Druid.io

Compute & Storage(Cloudera Dist.

Hadoop)

AWS cloud

Sanoma data center

S3Hadoop User Environment

(HUE)

60+NODES

720TBCAPACITY

475TBDATA STORED (GROSS)

180TBDATA STORED (NETTO)

150GBDAILY GROWTH

50+DATA SOURCES

200+DATA PROCESSES

3000+DAILY HADOOP JOBS

200+DASHBOARDS

275+AVG MONTHLY DASHBOARD USERS

Challenges

Positives§ Users have been steadily increasing§ Demand for ..

– more real time processing– faster data availability– higher availability (SLA)– quicker availability of new versions– specialized hardware (GPU’s/SSD’s)– quicker experiments (Fail Fast)

Negatives§ Data Center network support end of life§ Outsourcing of own operations team§ Version upgrades harder to manage§ Poor job isolation between test, dev, prod and

interactive, etl and data science workloads§ Higher level of security and access control


§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s


SM

L

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Easier to leverage spot pricing, due to data on S3

§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s


SM

L

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Target architecture


ETL(Jenkins & Jython)

QlikViewReporting

R Studio

Sources

EDW

AWS cloud

Sanoma data center

AmazonRDS

S3

S3 Buckets

Sources Work Hive Datawarehouse

EMR Cluster 1

Coreinstances

Taskinstances

TaskSpot

Instances

Amazon EMR

HUE

Zeppelin

Hive

Collecting

Serving

Stream processing

(Storm)Kafka

Redis /Druid.io

§ We’re almost done with the migration. Running in parallel now.§ Setup solves our challenges, some still require work§ Take your time testing and validating your setup

– If you have time, rethink whole setup– No time, move as-is first and then start optimizing

Conclusion & Learnings


EMR– Check data formats! RC vs ORC/Parquet– Start jobs from master node– Break up long running jobs, shorter

independent – Spot pricing & pre-empted nodes /w Spark– HUE and Zeppelin meta data on RDS– Research EMR FS behavior for your use case

§ S3– Bucket structure impacts performance– Setup access control, human error will occur– Uploading data takes time. Start early!

§ Check Snowball or new upload service

Design Patterns

Primitive: Multi-Stage Decoupled “Data Bus”

Multiple stagesStorage decouples multiple processing stages

Store Process Store Process

processstore

Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis

Amazon Kinesis

AWS Lambda

Amazon DynamoDB

Amazon Kinesis S3Connector

processstore

Amazon S3

Amazon EMR

Primitive: Analysis Frameworks Could Read from Multiple Data Stores

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis S3Connector

process store

Spark SQL

Spark Streaming Apache StormAWS Lambda

KCL appsAmazon Redshift

AmazonRedshift

Hive

Spark Presto

Amazon KinesisApache Kafka

Amazon DynamoDB Amazon S3data

Hot ColdData temperature

Proc

essi

ng s

peed

Fast

Slow Answers

Hive

Native appsKCL appsAWS Lambda

Data Temperature vs. Processing Speed

Amazon EMR

Real-time Analytics

ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time prediction

KPI

processstore

DynamoDBStreams

Amazon Kinesis

Message / EventProcessing

Amazon SQS

Amazon SQS App

Amazon SQS App

Amazon SNS Subscribers

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Publish

App state

KPI

processstore

Amazon SQS App

Amazon SQSApp

Auto Scaling group

Amazon SQSPriority queue Messages /

events

Interactive & BatchAnalytics

Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMR

Presto

Spark

Batch

Interactive

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Batch layer

AmazonKinesis

Datastream

processstore

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed layer

answer

Serving layer

AmazonElastiCache

AmazonDynamoDBAmazon

RDSAmazon

ES

answer

AmazonML

KCL

AWS Lambda

Storm

Lambda Architecture

Spark Streaming on Amazon EMR

Summary

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Thank you!aws.amazon.com/big-data

big data architectural patterns

Data & Analytics