2016 utah cloud summit: big data architectural patterns and best practices on aws

Martin Schade, Solutions Architect, Amazon Web Services

January 2016

Big Data Architectural Patterns and Best Practices

on AWS

What to Expect from the Session

Big data challengesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureDesign patterns

Ever Increasing Big Data

Volume

Velocity

Variety

Big Data Evolution

Batch

Report

Real-time

Alerts

Prediction

Forecast

Plethora of Tools

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis Cassandra

CloudSearch

Kinesis-enabled

app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

Is there a reference architecture ?What tools should I use ?

How ? Why ?

Architectural Principles

• Decoupled “data bus”• Data → Store → Process → Answers

• Use the right tool for the job• Data structure, latency, throughput, access patterns, cost

• Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

• Leverage AWS managed services• No/low admin

• Big data ≠ big cost

Simplify Big Data Processing

ingest /collect

store process /analyze

consume / visualize

data answers

Time to Answer (Latency)Throughput

Cost

Collect /Ingest

Types of Data

• Transactional• Database reads & writes (OLTP)• Cache

• Search• Logs• Streams

• File• Log files (/var/log)• Log collectors & frameworks

• Stream• Log records• Sensors & IoT data

Database

FileStorage

StreamStorage

A

iOS Android

Web Apps

Logstash

Logg

ing

IoT

Appl

icat

ions

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Search

Collect StoreLo

ggin

gIo

T

Stream Storage

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Stor

age

File

St

orag

e

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

FileStorage

Search

Collect StoreLo

ggin

gIo

TAp

plic

atio

ns

Which stream storage should I use?AmazonKinesis

DynamoDB Streams

Amazon SQSAmazon SNS

Kafka

Managed Yes Yes Yes No

Ordering Yes Yes No Yes

Delivery at-least-once exactly-once at-least-once at-least-once

Lifetime 7 days 24 hours 14 days Configurable

Replication 3 AZ 3 AZ 3 AZ Configurable

Throughput No Limit No Limit No Limit ~ Nodes

Parallel Clients Yes Yes No (SQS) Yes

MapReduce Yes Yes No Yes

Record size 1MB 400KB 256KB Configurable

Cost Low Higher(table cost) Low-Medium Low (+admin)

FileStorage

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Stor

age

File

St

orag

e

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

Search

Collect StoreLo

ggin

gIo

TAp

plic

atio

ns

Why Is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost

What about HDFS & Amazon Glacier?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for infrequently accessed data

• Use Amazon Glacier for archiving cold data

Database + Search

Tier

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Stor

age

File

St

orag

e

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Collect Store

Database + Search Tier Anti-pattern

RDBMS

Database + Search Tier

Applications

Best Practice — Use the Right Tool for the Job

Data TierSearchAmazon

Elasticsearch Service

Amazon CloudSearch

CacheRedisMemcached

SQLAmazon AuroraMySQLPostgreSQLOracleSQL Server

NoSQLCassandraAmazon

DynamoDBHBaseMongoDB

Applications

Database + Search Tier

What Data Store Should I Use?

• Data structure → Fixed schema, JSON, key-value

• Access patterns → Store data in the format you will access it

• Data / access characteristics → Hot, warm, cold

• Cost → Right cost

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (Key, Value) Cache, NoSQL

Simple relationships → 1:N, M:N NoSQL

Cross table joins, transaction, SQL SQL

Faceting, Search Search

Data Structure What to use?Fixed schema SQL, NoSQL

Schema-free (JSON) NoSQL, Search

(Key, Value) Cache, NoSQL

Process /Analyze

AnalyzeA

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic

Map

Redu

ce

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Proc

essi

ngBa

tch

Inte

ract

ive

Logg

ing

Stre

am

Stor

age

IoT

Appl

icat

ions

File

St

orag

e

Hot

Cold

WarmHot

Hot

ML

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Collect Store Analyze

Process / AnalyzeAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.

Examples• Interactive dashboards → Interactive analytics• Daily/weekly/monthly reports → Batch analytics• Billing/fraud alerts, 1 minute metrics → Real-time analytics• Sentiment analysis, prediction models → Machine learning

Analysis Tools and Frameworks

Machine Learning• Mahout, Spark ML, Amazon ML

Interactive Analytics• Amazon Redshift, Presto, Impala, Spark

Batch Processing• MapReduce, Hive, Pig, Spark

Stream Processing• Micro-batch: Spark Streaming, KCL, Hive, Pig• Real-time: Storm, AWS Lambda, KCL

Amazon Redshift

Impala

Pig

Amazon Machine Learning

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic

Map

Redu

ce

Stre

am

Proc

essi

ngBa

tch

Inte

ract

ive

ML

Analyze

What Stream Processing Technology Should I Use?Spark Streaming Apache Storm Amazon Kinesis

Client LibraryAWS Lambda Amazon EMR (Hive,

Pig)

Scale / Throughput

~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Batch or Real-time

Real-time Real-time Real-time Real-time Batch

Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto Scaling

AWS managed Yes (Amazon EMR)

Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ

Programming languages

Java, Python, Scala

Any language via Thrift

Java, via MultiLangDaemon ( .Net, Python, Ruby, Node.js)

Node.js, Java Hive, Pig, Streaming languages

Query Latency Low High

(Low is better)

Low Low

What Data Processing Technology Should I Use?AmazonRedshift

Impala Presto Spark Hive

Query Latency

Low Low Low Low Medium (Tez) – High (MapReduce)

Durability High High High High High

Data Volume 2 PB Max

~Nodes ~Nodes ~Nodes ~Nodes

Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)

Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3

SQL Compatibility

High Medium High Low (SparkSQL) Medium (HQL)

Query Latency Low High

(Low is better)

Low Low Medium

What about ETL?

Store Analyze

https://aws.amazon.com/big-data/partner-solutions/

ETL

Consume / Visualize

Collect Store Analyze Consume

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic

Map

Redu

ce

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Proc

essi

ngBa

tch

Inte

ract

ive

Logg

ing

Stre

am

Stor

age

IoT

Appl

icat

ions

File

St

orag

e

Anal

ysis

& V

isua

lizat

ion

Hot

Cold

WarmHot

Slow

Hot

ML

Fast

Fast

Transactional Data

File Data

Stream Data

Not

eboo

ks

Predictions

Apps & APIs

Mobile Apps

IDE

Search Data

ETL

Amazon QuickSight

Consume

• Predictions

• Analysis and Visualization

• Notebooks • IDE

• Applications & API

Consume

Anal

ysis

& V

isua

lizat

ion Amazon

QuickSight

Not

eboo

ks

Predictions

Apps & APIs

IDE

Store Analyze ConsumeETL

Business users

Data Scientist, Developers

Putting It All Together

Collect Store Analyze Consume

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic

Map

Redu

ce

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Ca

che

Stre

am

Proc

essi

ngBa

tch

Inte

ract

ive

Logg

ing

Stre

am

Stor

age

IoT

Appl

icat

ions

File

St

orag

e

Anal

ysis

& V

isua

lizat

ion

Hot

Cold

Warm

Hot

Slow

Hot

ML

Fast

Fast

Amazon QuickSight

Transactional Data

File Data

Stream Data

Not

eboo

ks

Predictions

Apps & APIs

Mobile Apps

IDE

Search Data

ETL

Reference Architecture

Thank you!Have Fun!

Design Patterns

Multi-Stage Decoupled “Data Bus”

• Multiple stages• Storage decoupled from processing

Store Process Store ProcessData Answers

processstore

Multiple Processing Applications (or Connectors) Can Read from or Write to Multiple Data Stores

Amazon Kinesis

AWS LambdaData Amazon

DynamoDB

Amazon Kinesis S3Connector

Amazon S3

processstore

Processing Frameworks (KCL, Storm, Hive, Spark, etc.) Could Read from Multiple Data Stores

Amazon Kinesis

AWS Lambda

Amazon S3Data Amazon

DynamoDB

Hive Spark

Answers

Storm

Answers

Amazon Kinesis S3Connector

processstore

Real-time Analytics

Producer ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time Prediction

KPI

processstore

DynamoDB Streams

Amazon Kinesis

Interactive & Batch Analytics

Producer Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMRPresto

Impala

Spark

Batch

Interactive

Batch Prediction

Real-time Prediction

Batch Layer

AmazonKinesis

data

processstore

Lambda Architecture

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed Layer

answer

Serving Layer

AmazonElastiCache

AmazonDynamoDB

AmazonRDS

AmazonES

answer

AmazonML

KCL

AWS Lambda

Spark Streaming

Storm

Summary

• Build decoupled “data bus”• Data → Store ↔ Process → Answers

• Use the right tool for the job• Latency, throughput, access patterns

• Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

• Leverage AWS managed services• No/low admin

• Be cost conscious • Big data ≠ big cost

Remember to complete your evaluations!

Thank you!