big data architectural patterns

Arie Leeuwesteijn, Solutions Architect, AWS

May 2016

Big Data Architectural Patterns and Best Practices on AWS

Agenda

Big data challengesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureSanoma Use CaseDesign patterns

Ever Increasing Big Data

Volume

Velocity

Variety

Big Data Evolution

Batch processing

Streamprocessing

Machinelearning

Plethora of Tools

Amazon Glacier

S3 DynamoDB

Amazon Redshift

Data PipelineAmazon Kinesis

Kinesis-enabled app

Lambda ML

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Big Data Challenges

Is there a reference architecture?

What tools should I use?

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Architectural Principles

Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

Time to answer (Latency)Throughput

COLLECT

Types of Data

Database records

Search documents

Log files

Messaging events

Devices / sensors / IoT streamDevices

Sensors & IoT platforms

AWS IoT STREAMS

StreamstorageIo

TCOLLECT STORE

Mobile apps

Web apps

Data centers AWS Direct Connect

RECORDSDatabase

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

Search

File store

MessagingMessage MESSAGES Queue

Amazon Kinesis Firehose

Amazon KinesisStreams

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed messaging

system

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices

AWS IoT STREAMS

COLLECT STORE

Mobile apps

Web apps

RECORDSDatabase

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

Search

File store

MessagingMessage MESSAGES

Why Stream Storage?

Decouple producers & consumersPersistent bufferCollect multiple streams

Preserve client orderingStreaming MapReduceParallel consumption

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic

What About Queues & Pub/Sub ? • Decouple producers &

consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for

Amazon SQS• Amazon SNS can route

to multiple queues or ʎ functions

• No streaming MapReduce

Consumers

Producers

Amazon SNS

Amazon SQS

function

AWS Lambda

Amazon SQSqueue

Subscriber

What Stream Storage should I use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS

AWS managed service

Yes Yes Yes No Yes

Guaranteedordering

Yes Yes Yes Yes No

Delivery exactly-once at-least-once exactly-once at-least-once at-least-once

Data retention period

24 hours 7 days N/A Configurable 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

Parallel clients Yes Yes No Yes No

Stream MapReduce Yes Yes N/A Yes N/A

Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm

COLLECT STORE

Mobile apps

Web apps

RECORDSDatabase

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

Search

Devices

AWS IoT STREAMS

Apache Kafka

Amazon S3

Amazon SQS

Amazon S3File

File Storage

Why is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost

What about HDFS & Amazon Glacier?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for infrequently accessed data

• Use Amazon Glacier for archiving cold data

Cache, database, search

COLLECT STORE

Mobile apps

Web apps

RECORDS

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

Devices

AWS IoT STREAMS

Apache Kafka

Amazon SQS

Amazon Elasticsearch Service

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Database Anti-pattern

Data TierDatabase tier

Best Practice - Use the Right Tool for the Job

Data TierSearch

AmazonElastiCache

RedisMemcached

Amazon AuroraMySQLPostgreSQLOracleSQL ServerMariaDB

Amazon DynamoDB

CassandraHBaseMongoDB

Database tier

Materialized Views

What Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data / access characteristics → Hot, warm, cold

Cost → Right cost

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQLCross table joins, transaction, SQL SQLFaceting, search, free text Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) Cache, NoSQL

What Is the Temperature of Your Data / Access ?

Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low – high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

Data / Access Characteristics: Hot, Warm, Cold

Cache SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Search

Amazon ElastiCache AmazonDynamoDB

AmazonRDS/Aurora

AmazonElasticsearch

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

What Data Store Should I Use?

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

https://calculator.s3.amazonaws.com/index.html

Simple Monthly Calculator

Object size(Bytes)

Objects per month

300 2,048 1,483 777,600,000

Amazon S3 orAmazon DynamoDB?

Object size(Bytes)

Objects per month

Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

COLLECT STORE PROCESS / ANALYZE

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Process /analyze

PROCESS / ANALYZE

Process / Analyze

• Batch - Minutes or hours on cold data• Daily/weekly/monthly reports

• Interactive – Seconds on warm/cold data• Self-service dashboards

• Messaging – Milliseconds or seconds on hot data • Message/event buffering

• Streaming - Milliseconds or seconds on hot data• Billing/fraud alerts, 1 minute metrics

What Analytics Technology Should I Use?Amazon Redshift

Amazon EMR

Presto Spark Hive

Query latency

Low Low Low High

Durability High High High High

Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes

AWS managed

Yes Yes Yes Yes

Storage Native HDFS / S3 HDFS / S3 HDFS / S3

SQL compatibility

High High Low (SparkSQL) Medium (HQL)

Predictions via Machine Learning

ML gives computers the ability to learn without being explicitly programmed

Machine learning algorithms:Supervised learning ← “teach” program

- Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value?

Unsupervised learning ← Let it learn by itself- Clustering ← Market segmentation

Tools and FrameworksMachine Learning

• Amazon ML, Amazon EMR (Spark ML)Interactive

• Amazon Redshift, Amazon EMR (Presto, Spark)Batch

• Amazon EMR (MapReduce, Hive, Pig, Spark)Messaging

• Amazon SQS application on Amazon EC2Streaming

• Micro-batch: Spark Streaming, KCL• Real-time: Amazon Kinesis Analytics, Storm,

AWS Lambda, KCL

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

PROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon EC2

What Streaming / Messaging Technology Should I Use?Spark Streaming

Apache Storm

Kinesis KCL Application

AWS Lambda Amazon SQS Apps

Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Micro-batch or Real-time

Micro-batch Real-time Near-real-time Near-real-time Near-real-time

AWS managed service

Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling)

Yes No (EC2 + AutoScaling)

Scalability No limits ~ nodes

No limits~ nodes

No limits No limits

Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ

Programminglanguages

Java, Python, Scala

Any language via Thrift

Java, via MultiLangDaemon (.NET,Python, Ruby, Node.js)

Node.js, Java, Python

AWS SDK languages (Java, .NET, Python, …)

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Amazon SQS apps

Streaming

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Presto

AmazonEMR

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

CONSUME

STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Applications & API

Analysis and visualization

Notebooks

Business users

Data scientist, developers

COLLECT ETL

Putting It All Together

Amazon SQS apps

Streaming

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Presto

AmazonEMR

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Reference architecture

AWS SummitSanoma Big Data Migration CaseSander Kieft

Sanoma, Publishing and Learning company

27 May 2016 AWS Summit– Sanoma Big Data MIgration47

2+1002 Finnish newspapers

Over 100 magazines in The Netherland, Belgium and

Finland

7TV channels in Finland and The Netherlands, incl. on

demand platforms

200+Websites

100Mobile applications on

various mobile platforms

30+Learning applications

§ Users of Sanoma’s websites, mobile applications, online TV products generate large volumes of data§ We use this data to improve our products for our users and our advertisers§ Use cases:

Sanoma, Big Data use cases

Dashboardsand Reporting

Product Improvements

Advertising optimization

• Reporting on various data sources• Self Service Data Access• Editorial Guidance

• A/B testing• Recommenders• Search optimizations• Adaptive learning/tutoring

• (Re)Targeting• Attribution• Auction price optimization

Starting point

ETL(Jenkins & Jython)

QlikViewReporting

Collecting

Serving

R Studio

Stream processing

(Storm)Kafka

Sources

Redis /Druid.io

Compute & Storage(Cloudera Dist.

Hadoop)

AWS cloud

Sanoma data center

S3Hadoop User Environment

60+NODES

720TBCAPACITY

475TBDATA STORED (GROSS)

180TBDATA STORED (NETTO)

150GBDAILY GROWTH

50+DATA SOURCES

200+DATA PROCESSES

3000+DAILY HADOOP JOBS

200+DASHBOARDS

275+AVG MONTHLY DASHBOARD USERS

Challenges

Positives§ Users have been steadily increasing§ Demand for ..

– more real time processing– faster data availability– higher availability (SLA)– quicker availability of new versions– specialized hardware (GPU’s/SSD’s)– quicker experiments (Fail Fast)

Negatives§ Data Center network support end of life§ Outsourcing of own operations team§ Version upgrades harder to manage§ Poor job isolation between test, dev, prod and

interactive, etl and data science workloads§ Higher level of security and access control

§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Easier to leverage spot pricing, due to data on S3

§ Three scenario’s evaluated for moving

Amazon Migration Scenario’s

EC2 EBS S3 S3EMREC2 EBS

§ All services on EC2

§ Only source data on S3

§ All data on S3

§ EMR for Hadoop

§ EC2 only for utility services not provided by EMR

S3EMREC2 RedshiftEBS

§ All data on S3

§ EMR for Hadoop

§ Interactive querying workload is moved to Redshift instead of Hive

Target architecture

ETL(Jenkins & Jython)

QlikViewReporting

R Studio

Sources

AWS cloud

Sanoma data center

AmazonRDS

S3 Buckets

Sources Work Hive Datawarehouse

EMR Cluster 1

Coreinstances

Taskinstances

TaskSpot

Instances

Amazon EMR

Zeppelin

Collecting

Serving

Stream processing

(Storm)Kafka

Redis /Druid.io

§ We’re almost done with the migration. Running in parallel now.§ Setup solves our challenges, some still require work§ Take your time testing and validating your setup

– If you have time, rethink whole setup– No time, move as-is first and then start optimizing

Conclusion & Learnings

EMR– Check data formats! RC vs ORC/Parquet– Start jobs from master node– Break up long running jobs, shorter

independent – Spot pricing & pre-empted nodes /w Spark– HUE and Zeppelin meta data on RDS– Research EMR FS behavior for your use case

§ S3– Bucket structure impacts performance– Setup access control, human error will occur– Uploading data takes time. Start early!

§ Check Snowball or new upload service

Design Patterns

Primitive: Multi-Stage Decoupled “Data Bus”

Multiple stagesStorage decouples multiple processing stages

Store Process Store Process

processstore

Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis

Amazon Kinesis

AWS Lambda

Amazon DynamoDB

Amazon Kinesis S3Connector

processstore

Amazon S3

Amazon EMR

Primitive: Analysis Frameworks Could Read from Multiple Data Stores

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis S3Connector

process store

Spark SQL

Spark Streaming Apache StormAWS Lambda

KCL appsAmazon Redshift

AmazonRedshift

Spark Presto

Amazon KinesisApache Kafka

Amazon DynamoDB Amazon S3data

Hot ColdData temperature

Slow Answers

Native appsKCL appsAWS Lambda

Data Temperature vs. Processing Speed

Amazon EMR

Real-time Analytics

ApacheKafka

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

App state

Real-time prediction

processstore

DynamoDBStreams

Amazon Kinesis

Message / EventProcessing

Amazon SQS

Amazon SQS App

Amazon SNS Subscribers

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Publish

App state

processstore

Amazon SQS App

Amazon SQSApp

Auto Scaling group

Amazon SQSPriority queue Messages /

events

Interactive & BatchAnalytics

Amazon S3

Amazon EMR

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMR

Presto

Interactive

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Batch layer

AmazonKinesis

Datastream

processstore

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Spark answer

Speed layer

answer

Serving layer

AmazonElastiCache

AmazonDynamoDBAmazon

RDSAmazon

answer

AmazonML

AWS Lambda

Lambda Architecture

Spark Streaming on Amazon EMR

Summary

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Thank you!aws.amazon.com/big-data

big data architectural patterns

Data & Analytics

architectural patterns

1 another group of patterns architectural patterns

architectural patterns for parallel programming · table 1:...

aws november webinar series - architectural patterns & best...

2016 utah cloud summit: big data architectural patterns and...

aws re:invent 2016: big data architectural patterns and best...

architectural patterns for big data on aws · pdf...

big data architectural patterns

architectural patterns and styles - piazza

software architectural design 1. topics covered ...

big data and architectural patterns on aws - pop-up loft tel...

architectural patterns for the cloud

lecture 1 chapter 6 architectural design1. topics covered...

architectural patterns revisited – a pattern...

architectural and design patterns

(bdt310) big data architectural patterns and best practices...

architectural patterns - prairie trail

patterns what are patterns? software patterns design...

(bdt310) big data architectural patterns and best practices...

architectural patterns: examples of use