capturing & processing real-time data on aws

Post on 09-Dec-2016

233 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Agenda

¨  Real-Time Analytics ¤ Data Ingestion ¤ Data Processing

n Architecture n AWS Lambda

¨  Customer Implementations

Real-Time Analytics

Real-time Ingest!•  Highly Scalable"•  Durable"•  Elastic "•  Replay-able Reads""

Continuous Processing FX !•  Load-balancing incoming streams"•  Fault-tolerance, Checkpoint / Replay"•  Elastic"•  Enable multiple apps to process in parallel"

Continuous data flow!

Low end-to-end latency!

Continuous, real-time workloads!

+

Data Ingestion

Global top-10

foo-analysis.com

Starting simple...

Global top-10 Elastic Beanstalk foo-analysis.com

Distributing the workload…

Global top-10

Elastic Beanstalk foo-analysis.com

Local top-10

Local top-10

Local top-10

Or using a Elastic Data Broker…

Global top-10

Elastic Beanstalk foo-analysis.com

K I N E S I S

Data Record

Stream Shard

Partition Key

Worker

My top-10

Data Record Sequence Number

14 17 18 21 23

Amazon Kinesis – Managed Stream

AWS

Endp

oint

S3

DynamoDB

Redshift

Data Sources

Availability Zone

Availability Zone

Data Sources

Data Sources

Data Sources

Data Sources

Availability Zone

Shard 1

Shard 2

Shard N

[Data Archive]

[Metric Extraction]

[Sliding Window Analysis]

[Machine Learning]

App. 1

App. 2

App. 3

App. 4

EMR

Amazon Kinesis – Common Data Broker

Amazon Kinesis – Distributed Streams ¨  From batch to continuous processing

¨  Scale shards elastically UP or DOWN without losing sequencing

¨  Workers can replay records for up to 24 hours

¨  Scale up to GB/sec without losing durability •  Records stored across multiple availability zones

¨  Multiple parallel Kinesis Apps output to anything… •  RDBMS, S3, In-house Data Warehouse, Messaging, another stream, JavaSDK, PythonSDK, etc.

Data Processing

Batch

Micro Batch

Real Time

Emerging Architecture…

Batch Analysis

DW Hadoop

Notifications

& Alerts

Dashboards/ visualizations

APIs Streaming Analytics

Data Streams

Deep Learning

Dashboards/ visualizations

Spark Storm KCL

Data Archive

Real-time: Event-based processing

Kinesis  Storm  Spout  

Producer  Amazon    Kinesis  

Apache  Storm  

Elas7Cache  (Redis)   Node.js   Client  

(D3)  

hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-­‐a-­‐Real-­‐7me-­‐Sliding-­‐Window-­‐Applica7on-­‐Using-­‐Amazon-­‐Kinesis-­‐and-­‐Apache    

Micro-Batches: Drip feeding the data

hAp://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-­‐Prac7ces-­‐for-­‐Micro-­‐Batch-­‐Loading-­‐on-­‐Amazon-­‐RedshiY    

 

Offline    Analysis  

Ad-­‐hoc  Analysis  

 

Offline Batch: Hadoop for discovery

EMR  S3  Kinesis  Applica7on  Producer   Amazon  Kinesis  

EMR

Hive Pig

Cascading MapReduce

hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-­‐a-­‐Real-­‐7me-­‐Sliding-­‐Window-­‐Applica7on-­‐Using-­‐Amazon-­‐Kinesis-­‐and-­‐Apache    

Amazon  Kinesis  

Batch

Micro Batch

Real Time

Putting it together…

Producer   Amazon  Kinesis   App   Client  

EMR  S3  

KCL  

Apache  Storm   DynamoDB  

RedshiY  

BI  Tools  

KCL  

AWS Lambda

An event-driven computing service for dynamic applications

“AWS  Lambda  func/ons  can  be  triggered  by  data  stream  updates  from  Amazon  Kinesis  and  Amazon  DynamoDB.  For  instance,  you  can  watch  for  a  paBern,  such  as  an  address,  and  trigger  an  alert.”  

A focus on functions, data and events

Cloud  func7ons  

S3 event notifications

DynamoDB Streams

Kinesis events

Custom events

Stream processing

Data triggers Server-free back-end

IoT Indexing & synchronization

Putting AWS Lambda to work

Photo bucket S3

Metadata DynamoDB

Trending DynamoDB

Extract Metadata

Cloud Function

Trending Cloud

Function

NotifyCloud Function

SNS Push notification

AWS Lambda for reactive computing

Processing Events from Kinesis

hAp://docs.aws.amazon.com/lambda/latest/dg/walkthrough-­‐kinesis-­‐events-­‐adminuser.html    

Write  million  of  events  from  Kinesis  into  Elas7search  with  only  60  lines  of  code!!!    hAps://gist.github.com/tylr/e8baf45c07ced23ef013      

Customer deployments on AWS

GREE International – re:Invent 2014

¨  GAM301 - Real-Time Game Analytics with Amazon Kinesis, Redshift, and DynamoDB

¨  Session - https://www.youtube.com/watch?v=ElpWlj6yi44

¨  Slide: http://www.slideshare.net/AmazonWebServices/gam301-realtime-game-analytics-with-amazon-kinesis-amazon-redshift-and-amazon-dynamodb-aws-reinvent-2014

Key Requirements for Analytics

Initial Requreiments

¨  Data collection & streaming to database

¨  Zero data loss ¨  Zero data corruption ¨  Guaranteed data

delivery

New Requirements

¨  Near real-time data latency

¨  Real-time ad-hoc analysis

¨  Ease of adding consumers

¨  Managed Service

Data Collection

Source of Data ¨  Mobile Devices ¨  Game Servers ¨  Ad Networks

Data Sizes ¨  Size of event ~ 1 KB ¨  500M+ events/day ¨  500G+/day &

growing ¨  JSON format

Architecture

SocialMetrix – re:Invent 2014

¨  ARC202: Real-World Real-Time Analytics ¨  Session:

https://www.youtube.com/watch?v=NIa33ZwFa8E ¨  Slides:

http://www.slideshare.net/zer0/arc202-arc202-real-world-real-time-analytics20141109mhfinaledit

Drivers for architecture evolution

•  More customers, bigger customers

•  Add new features

•  Keep costs under control

Requirements at 4th iteration ¤ Monitor millions of social media profiles

¤ Make data accessible (exploration, PoC)

¤  Improve UI response times

¤  Testing our data pipelines

¤  Reprocessing (faster)

Architecture

-

20

40

60

80

100

120

140

160

0

20

40

60

80

100

120

#1 #2 #3 #4

Act

ive

Cus

tom

ers

Costs Customers

Cost over Architecture…

THANK YOU !!! http://aws.amazon.com/big-data

top related