aws webcast - managing big data in the aws cloud_20140924

Managing Big Data in the AWS Cloud

Siva Raghupathy

Principal Solutions Architect

Amazon Web Services

Agenda

• Big data challenges • AWS big data portfolio• Architectural considerations• Customer success stories• Resources to help you get started• Q&A

Data Volume, Velocity, & Variety

• 4.4 zettabytes (ZB) of data exists in the digital universe today– 1 ZB = 1 billion terabytes

• 450 billion transaction per day by 2020

• More unstructured data than structured data GB

TB

PB

ZB

EB

1990 2000 2010 2020

Big Data• Hourly server logs: how your systems were

misbehaving an hour ago

• Weekly / Monthly Bill: What you spent this past billing cycle?

• Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time

• Daily fraud reports: tells you if there was fraud yesterday

Real-time Big Data• Real-time metrics: what just went wrong

now

• Real-time spending alerts/caps: guaranteeing you can’t overspend

• Real-time analysis: tells you what to offer the current customer now

• Real-time detection: blocks fraudulent use now

Big Data : Best Served Fresh

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Available for analysis

Generated dataData volume - Gap

1990 2000 2010 2020

Data Analysis Gap

Big Data

Potentially massive datasets

Iterative, experimental style of data manipulation and analysis

Frequently not a steady-state workload;

peaks and valleys

Time to results is key

Hard to configure/manage

AWS Cloud

Massive, virtually unlimited capacity

Iterative, experimental style of infrastructure deployment/usage

At its most efficient with highly variable workloads

Parallel compute clusters from singe data source

Managed services

AWS Big Data Portfolio

Collect / Ingest

Kinesis

Process / Analyze

EMR EC2

Redshift Data Pipeline

Visualize / ReportStore

Glacier

S3

DynamoDB

RDS

Import Export

Direct Connect

Amazon SQS

Ingest: The act of collecting and storing data

Why Data Ingest Tools?

• Data ingest tools convert random streams of data into fewer set of sequential streams

– Sequential streams are easier to process

– Easier to scale

– Easier to persist

Data Ingest Tools

• Facebook Scribe Data collectors

• Apache Kafka Data collectors

• Apache Flume Data Movement and Transformation

• Amazon Kinesis Data collectors

Real-time processing of streaming data

High throughput

Elastic

Easy to use

Connectors for EMR, S3, Redshift, DynamoDB

Amazon Kinesis

Amazon Kinesis ArchitectureAmazon Kinesis Architecture

Kinesis Stream: Managed ability to capture and store data

• Streams are made of Shards

• Each Shard ingests data up to

1MB/sec, and up to 1000 TPS

• Each Shard emits up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams by adding or

removing Shards

• Replay data inside of 24Hr. Window

Simple Put interface to store data in Kinesis

• Producers use a PUT call to store data in a Stream

• PutRecord {Data, PartitionKey,

StreamName}

• A Partition Key is supplied by producer and used to

distribute the PUTs across Shards

• Kinesis MD5 hashes supplied partition key over the hash

key range of a Shard

• A unique Sequence # is returned to the Producer upon a

successful PUT call

Building Kinesis Processing Apps: Kinesis Client LibraryClient library for fault-tolerant, at least-once, Continuous Processing

o Java client library, source available on Github

o Build & Deploy app with KCL on your EC2 instance(s)

o KCL is intermediary b/w your application & stream

Automatically starts a Kinesis Worker for each shard

Simplifies reading by abstracting individual shards

Increase / Decrease Workers as # of shards changes

Checkpoints to keep track of a Worker’s location in

the stream, Restarts Workers if they fail

o Integrates with AutoScaling groups to redistribute workers

to new instances

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client Library +Connector Library

Apache Storm

Amazon Elastic MapReduce

Sending Reading

Write Read

AWS Partners for Data Load and Transformation

Hparser, Big Data Edition

Flume, Sqoop

http://www.talend.com/

Storage

Storage

Structured – Simple QueryNoSQL

Amazon DynamoDBCache

Amazon ElastiCache (Memcached, Redis)

Structured – Complex QuerySQL

Amazon RDS Data Warehouse

Amazon RedshiftSearch

Amazon CloudSearch

Unstructured – No QueryCloud Storage

Amazon S3Amazon Glacier

Unstructured – Custom QueryHadoop/HDFS

Amazon Elastic Map Reduce

Dat

a St

ruct

ure

Com

plex

ity

Query Structure Complexity

Store anything

Object storage

Scalable

Designed for 99.999999999% durability

Amazon S3

Why is Amazon S3 good for Big Data?

• No limit on the number of Objects• Object size up to 5TB• Central data storage for all systems• High bandwidth• 99.999999999% durability• Versioning, Lifecycle Policies• Glacier Integration

Amazon S3 Best Practices

• Use random hash prefix for keys

• Ensure a random access pattern

• Use Amazon CloudFront for high throughput GETs and PUTs

• Leverage the high durability, high throughput design of Amazon S3 for backup

and as a common storage sink

• Durable sink between data services

• Supports de-coupling and asynchronous delivery

• Consider RRS for lower cost, lower durability storage of derivatives or copies

• Consider parallel threads and multipart upload for faster writes

• Consider parallel threads and range get for faster reads

Aggregate All Data in S3 Surrounded by a collection of the right tools

EMR Kinesis

Redshift DynamoDB RDS

Data Pipeline

Spark StreamingCassandra Storm

Amazon S3

Amazon S3

Fully-managed NoSQL database service

Built on solid-state drives (SSDs)

Consistent low latency performance

Any throughput rate

No storage limits

Amazon DynamoDB

DynamoDB Concepts

attributes

items

table

schema-lessschema is defined per attribute

DynamoDB: Access and Query Model

• Two primary key options• Hash key: Key lookups: “Give me the status for user abc”• Composite key (Hash with Range): “Give me all the status updates for user ‘abc’

that occurred within the past 24 hours”

• Support for multiple data types– String, number, binary… or sets of strings, numbers, or binaries

• Supports both strong and eventual consistency– Choose your consistency level when you make the API call– Different parts of your app can make different choices

• Global Secondary Indexes

DynamoDB: High Availability and Durability

• Regional service• Synchronous replication to

three Availability Zones• Writes acknowledged only

when they are on disk in at least two Availability Zones

What does DynamoDB handle for me?

• Scaling without down-time• Automatic sharding• Security inspections, patches, upgrades• Automatic hardware failover• Multi-AZ replication• Hardware configuration designed specifically for DynamoDB• Performance tuning

…and a lot more

Amazon DynamoDB Best Practices

• Keep item size small• Store metadata in Amazon DynamoDB and blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use hash-range key to model

– 1:N relationships– Multi-tenancy

• Avoid hot keys and hot partitions• Use table per day, week, month etc. for storing time series data• Use conditional updates

Relational Databases

Fully managed; zero admin

MySQL, PostgreSQL, Oracle & SQL Server

Amazon RDS

Process and Analyze

Processing Frameworks

• Batch Processing– Take large amount (>100TB) of cold data and ask questions– Takes hours to get answers back

• Stream Processing (real-time)– Take small amount of hot data and ask questions – Takes short amount of time to get your answer back

Processing Frameworks

• Batch Processing– Amazon EMR (Hadoop)– Amazon Redshift

• Stream Processing– Spark Streaming– Storm

Columnar data warehouse

ANSI SQL compatible

Massively parallel

Petabyte scale

Fully-managed

Very cost-effective

Amazon

Redshift

Amazon Redshift architecture

• Leader Node– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Parallel load from Amazon DynamoDB

• Hardware optimized for data processing

• Two hardware platforms– DW1: HDD; scale from 2TB to 1.6PB

– DW2: SSD; scale from 160GB to 256TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


LeaderNode

Amazon Redshift Best Practices

• Use COPY command to load large data sets from Amazon S3, Amazon

DynamoDB, Amazon EMR/EC2/Unix/Linux hosts

– Split your data into multiple files

– Use GZIP or LZOP compression

– Use manifest file

• Choose proper sort key

– Range or equality on WHERE clause

• Choose proper distribution key

– Join column, foreign key or largest dimension, group by column

Hadoop/HDFS clusters

Hive, Pig, Impala, HBase

Easy to use; fully managed

On-demand and spot pricing

Tight integration with S3,

DynamoDB, and Kinesis

Amazon Elastic

MapReduce

EMR Cluster

S3

1. Put the data into S3

2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop

apps like Hive/Pig/HBase

4. Get the output from S3

3. Launch the cluster using the EMR console, CLI, SDK, or

APIs

How Does EMR Work?

EMR

EMR Cluster

S3

You can easily resize the cluster

And launch parallel clusters using the same

data

How Does EMR Work?

EMR

EMR Cluster

S3

Use Spot nodes to save time and money

How Does EMR Work?

The Hadoop Ecosystem works inside of EMR

Amazon EMR Best Practices

• Balance transient vs persistent clusters to get the best TCO

• Leverage Amazon S3 integration– Consistent View for EMRFS

• Use Compression (LZO is a good pick)• Avoid small files (< 100MB; s3distcp can help!)• Size cluster to suit each job• Use EC2 Spot Instances

Amazon EMR Nodes and Size

• Tuning cluster size can be more efficient than tuning Hadoop code• Use m1 and c1 family for functional testing• Use m3 and c3 xlarge and larger nodes for production workloads• Use cc2/c3 for memory and CPU intensive jobs• hs1, hi1, i2 instances for HDFS workloads • Prefer a smaller cluster of larger nodes

Partners – Analytics (Scientific, algorithmic, predictive, etc)

Visualize

Partners - BI & Data Visualization

Putting All The AWS Data Tools Together & Architectural Considerations

One tool to rule them all

Data Characteristics: Hot, Warm, Cold

Hot Warm Cold

Volume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low–High High Very High

Request rate Very High High LowCost/GB $$-$ $-¢¢ ¢

Average latency

Datavolume

Item size

Request rate

Cost ($/GB/month)

Durability

Elasti-Cache

ms

GB

B-KB

Very High

Low -Moderate

$$

AmazonDynamoDB

ms

GB-TBs(no limit)

B-KB(64 KB max)

Very High

Very High

¢¢

AmazonRDS

ms.sec

High

High

¢¢

GB-TB(3 TB max)

KB(~rowsize)

CloudSearch

ms.sec

High

High

$

GB-TB

KB(1 MB max)

AmazonRedshift

sec.min

Low

High

¢

KB(64 K max)

TB-PB(1.6 PB max)

AmazonEMR (Hive)

sec.min,hrs

Low

High

¢

KB-MB

GB-PB(~nodes)

AmazonS3

Very High

¢

KB-GB(5 TB max)

GB-PB(no limit)

ms,sec,min (~size)

Low-Very High (no limit)

AmazonGlacier

Very High

¢

GB(40 TB max)

GB-PB(no limit)

hrs

Very Low(no limit)

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000


Object size(Bytes)


Objects per month

300 2,048 1,483 777,600,000 DynamoDB or S3?

http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E


Object size(Bytes)


Objects per month

Scenario 1 300 2,048 1,483 777,600,000

Scenario 2 300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use

http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E

http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94

Lambda Architecture

Putting it all togetherDe-coupled architecture

• Multi-tier data processing architecture• Ingest & Store de-coupled from Processing• Ingest tools write to multiple data stores • Processing frameworks (Hadoop, Spark, etc.) read from data stores• Consumers can decide which data store to read from depending on

their data processing requirement

SparkStreaming /

Storm

Redshift

Impala Spark

EMR/Hadoop

Redshift

EMR/Hadoop

Spark

Kinesis/Kafka

NoSQL / DynamoDB / Hadoop HDFS S3Data

Hot ColdData Temperature

Latency

Low

HighAnswers

Customer Use Cases

Autocomplete Search RecommendationsAutomatic spelling corrections

A look at how it works

Months of user history Common misspellings

Data Analyzed Using EMR:

WestenWistin

WestanWhestin

Automatic spelling corrections

Months of user search data

Search terms

Misspellings

Final click throughs

Yelp web site log data goes into Amazon S3

Amazon S3

Amazon Elastic MapReduce spins up a 200 node Hadoop cluster

Hadoop Cluster

Amazon EMRAmazon S3

Hadoop Cluster

Amazon EMRAmazon S3

All 200 nodes of the cluster simultaneously look for common misspellings

Westen

Wistin

Westan

Hadoop Cluster

Amazon EMRAmazon S3

A map of common misspellings and suggested corrections are loaded back into Amazon S3.

Westen

Wistin

Westan

Then the cluster is shut down Yelp only pays for the time they used it

Hadoop Cluster

Amazon EMRAmazon S3

Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem

spins up over 250 Hadoop clusters per

week in EMR.

Amazon EMRAmazon S3

Data Innovation Meets Action at Scale at NASDAQ OMX

• NASDAQ’s technology powers more than 70 marketplaces in 50 countries

• NASDAQ’s global platform can handle more than 1 million messages/second at a

median speed of sub-55 microseconds

• NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories

• More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion

• NASDAQ powers 1 in 10 of the world’s securities transactions

NASDAQ’s Big Data Challenge

• Archiving Market Data– A classic “Big Data” problem

• Power Surveillance and Business Intelligence/Analytics

• Minimize Cost– Not only infrastructure, but development/IT labor costs too

• Empower the business for self-service

1 2 3 4 5 6 7 8 90

100,000,000200,000,000300,000,000400,000,000500,000,000600,000,000

NASDAQ Exchange Daily Peak Messages

MarketDataIs BigDataCharts courtesy of the Financial Information Forum

NASDAQ’s Legacy Solution

• On-premises MPP DB– Relatively expensive, finite storage– Required periodic additional expenses to add more storage– Ongoing IT (administrative) human costs

• Legacy BI tool– Requires developer involvement for new data sources, reports,

dashboards, etc.

New Solution: Amazon Redshift

• Cost Effective– Redshift is 43% of the cost of legacy

• Assuming equal storage capacities

– Doesn’t include IT ongoing costs!

• Performance– Outperforms NASDAQ’s legacy BI/DB solution– Insert 550K rows/second on a 2 node 8XL cluster

• Elastic– NASDAQ can add additional capacity on demand, easy to grow their cluster

• Amazon Redshift partner– http://aws.amazon.com/redshift/partn

ers/pentaho/

• Self Service– Tools empower BI users to integrate

new data sources, create their own analytics, dashboards, and reports without requiring development involvement

• Cost effective

New Solution: Pentaho BI/ETL

http://aws.amazon.com/redshift/partners/pentaho/

http://aws.amazon.com/redshift/partners/pentaho/

Net Result

• New solution is cheaper, faster, and offers capabilities that NASDAQ didn’t have before– Empowers NASDAQ’s business users to explore data like they never

could before– Reduces IT and development as bottlenecks– Margin improvement (expense reduction and supports business

decisions to grow revenue)

NEXT STEPS

AWS is here to help

Solution Architects

Professional Services

Premium Support

AWS Partner Network (APN)

aws.amazon.com/partners/competencies/big-data

Partner with an AWS Big Data expert

http://aws.amazon.com/marketplace

Big Data Case Studies

Learn from other AWS customers

aws.amazon.com/solutions/case-studies/big-data

AWS Marketplace

AWS Online Software Store

aws.amazon.com/marketplace

Shop the big data category

http://aws.amazon.com/marketplace

AWS Public Data Sets

Free access to big data sets

aws.amazon.com/publicdatasets

AWS Grants Program

AWS in Education

aws.amazon.com/grants

AWS Big Data Test Drives

APN Partner-provided labs

aws.amazon.com/testdrive/bigdata

https://aws.amazon.com/training

AWS Training & Events

Webinars, Bootcamps, and Self-Paced Labs

aws.amazon.com/events

Big Data on AWS

Course on Big Data

aws.amazon.com/training/course-descriptions/bigdata

reinvent.awsevents.com

aws.amazon.com/big-data

[email protected]

Thank You!

aws webcast - managing big data in the aws cloud_20140924

Technology

big data

store data

object size

parallel threads

total size

amazon s3

emr work

amazon dynamodb