big data: mejores prácticas en aws

Javier Ros, Solution Architect

Jun, 2016

2016 Big Data. Mejores practicas en

Agenda

Big Data challenges

Design Patterns on AWS

RavenPack. Big Data for Financial Applications

Shopping cart

Big Data Challenges

Volume

Velocity

Variety

Simplify Big Data Processing

data answers

Time to Answer (Latency)

Throughput

ingest / collect

storeprocess /analyze

consume / visualize

On-Demand Big Data Analytics

Young Huang. Director, Big Data Analytics.

“We were able to save about 90% over the EC2 ondemand cost”

Clickstream Analysis

Suneel Sajnani. Senior VP of Enterprise Technology

Kinesis and Spark to process more than 30TB per day

Event-driven Extract, Transform, Load (ETL)

Brian Filppu. Director of Business Intelligence

Kinesis, Lambda and EMR for 16 million events per day

Smart Applications

Joe Emison. Founder & Chief Technology Officer

“Amazon Machine Learning democratizes the process of building predictive

models. It's easy and fast to use, and has machine-learning best practices

encapsulated in the product, which lets us deliver results significantly faster than

in the past.”

June 2, 2016

RavenPackAWS SummitMadrid

Mapping the World’s Big Datafor Financial Applications

Jose Luis Cruz ‒ Operations Managerjlcruz@ravenpack.com

● What is RavenPack?

● Current Use Cases in the Cloud

● What’s Next?

11ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• RavenPack delivers big data analytics to financial professionals

• 80% of big data is unstructured

• Only 29% of decisions are based on big data.

RavenPack at a Glance

80% 29%

Top hedge funds and investment banks use RavenPack for trading and risk management

RavenPack processes hundreds of thousands of documents each day

We produce machine readable analytics for each document in real time <250ms

Archive of +300 million documents, over +20 years

RavenPack at a Glance

Classic Model

• 6 Servers, 19 KVM virtual machines

• Limited Storage - Expensive to Upgrade

• Multiple Points of Failure

Use Case: Realtime Classification

CollectorsRT Feed

Snapshots

Classifier

Cloud Model using AWS

• CloudFormation to model the Stack

• Unlimited, Distributed Storage

• Easy redundancy, failover and backup

Use Case: Realtime Classification

Amazon

CloudFormation

Amazon

DynamoDB

Amazon

S3Amazon

Amazon

CloudSearch

Amazon

Redshift

Amazon

Kinesis

RT Feed

Snapshots

ClassifiersCollectors

Classic Model

• Same Limited Set of Servers, Same RDBMS

• Can affect Realtime System, Backups

• Full archive, 4-6 Classifiers → 6 weeks!

Use Case: History Classification

RDBMS FilesClassifiers

Classifiers

Cloud Model using AWS

• Servers on Demand, Distributed Storage

• Independent of Realtime System

• Full archive, 100 Classifiers → from 6 weeks to 3 days!

Use Case: History Classification

Amazon

CloudFormation

Amazon

DynamoDB

Amazon

Redshift

Availability ZoneAvailability Zone

Classifiers

Coordinator

• Structured BIG DATA available:

Consensus and estimates

Online purchases

Bank and credit card transactions

Satellite photographical information

• Can improve current analytics or create new ones

• Challenges

Amount of data available

Mapping all those different datasets

• Solution: Kinesis + RedShift + EMR

Future: Incorporating Structured Data

Amazon

Redshift

Amazon

Kinesis

Download a Custom “Slice” of Analytics Data

• Provide a Web-API and Web Service

• Let client specify parameters

Data Set and Time Range

Entities and Events

Filters

• Leverage Amazon RedShift and S3

• Compression and Multiple Output Formats

Future: Self-Service Data

Amazon

Redshift

Amazon API

Gateway

Amazon

EC2AWS

Lambda

• Let Clients upload Proprietary Contentto the Amazon Virtual Private Cloud (VPC)

Internal documents / research

Email, Instant Messaging

CRM, bug tracking system

Client Support Calls transcriptions

• Provision Computing and Storage Resourceson a Per Project Basis

• View Private Analytics in Isolation or AlongsideStandard RavenPack Analytic DataSets

• Everything Goes Away when Project Completes

Future: The RavenPack Cloud

Amazon

DynamoDBAmazon

Amazon

Redshift

Amazon

CloudFormation

Amazon

CloudSearch

RavenPack International

Thanks for listening!

ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +44 (0) 782 783 8282

Jose Luis Cruz: jlcruz@ravenpack.com

Shopping Cart

http://amzn.to/BigDataSummit

Shopping cart

Business Metrics

Time to buy

Time to cancel

Number of sales

Sales per country

Architecture

client

mobile client

API Server

Cart event

Amazon

KinesisAmazon

Amazon

S3Amazon

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

Customer events

“type”: “productAdded”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“product”: 937293

“type”: “productRemoved”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“product”: 937293

“type”: “cartBuy”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438

“productlist”: [34, 253]

“type”: “cartDiscard”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“productlist”: [2353, 1355, 1234]

Amazon Kinesis Firehose

Architecture

client

mobile client

API Server

Cart event

Amazon

KinesisAmazon

Amazon

S3Amazon

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

AWS Data Pipeline

AWS Elastic MapReduce

Pig script

DATA = LOAD 's3://shoppingcart-summit/streams/$inputdate/*' USING JsonLoader('type:chararray,timestamp:int,customer:int,cart:long,product:chararray, productlist: chararray');

DATA2 = FILTER DATA BY type is not null;

CARTS = GROUP DATA2 BY cart;

CARTDATA = FOREACH CARTS {

LOGIN = FILTER DATA2 BY type == 'login';

ADDED = FILTER DATA2 BY type == 'productAdded';

REMOVED = FILTER DATA2 BY type == 'productRemoved';

BUY = FILTER DATA2 BY type == 'cartBuy';

GENERATE MAX(DATA2.customer) AS customer, group AS cart,

MAX(DATA2.timestamp)-MIN(DATA2.timestamp) AS duration, IsEmpty(BUY) AS buy,

COUNT_STAR(ADDED) AS added, COUNT_STAR(REMOVED) AS removed,

MAX(DATA2.timestamp)-MAX(ADDED.timestamp) AS thinking,

MIN(LOGIN.timestamp) AS timestamp, '\"\"';

STORE CARTDATA INTO 's3://shoppingcart-summit/redshift/$inputdate/' USING PigStorage(',');

AWS Quicksight

Architecture

client

mobile client

API Server

Cart event

Amazon

KinesisAmazon

Amazon

S3Amazon

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

Machine learning and smart applications

Machine learning is the technology that

automatically finds patterns in your data and

uses them to make predictions for new data

points as they become available

Your data + machine learning = smart applications

Introducing Amazon Machine Learning

Easy to use, managed machine learning service built for developers

Robust, powerful machine learning technology based on Amazon’s internal systems

Create models using your data already stored in the AWS cloud

Deploy models to production in seconds

Trainmodel

Evaluate andoptimize

Retrieve predictions

Building smart applications with Amazon ML

- Create a Datasource object pointing to the shopping cart

processed data

- Explore and understand your data

- Transform data and train your model

Trainmodel

- Understand model quality

- Adjust model interpretation

Explore model quality

Fine-tune model interpretation

Trainmodel

- Batch predictions

- Real-time predictions

Real-time predictions for interactive applications

Your application

Query for predictions with

Amazon ML real-time API

ml = boto3.client('machinelearning')

prediction = ml.predict(

MLModelId='ml-dZxbrDXAstA',

Record={

'customer': '4634',

’cart': '13661535770434', …

PredictEndpoint='https://realtime.machinelearning….’

Architecture

client

mobile client

API Server

Cart event

Amazon

KinesisAmazon

Amazon

S3Amazon

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

http://amzn.to/BigDataSummit

big data: mejores prácticas en aws

Technology

mejores prácticas registrales y catastrales mejores...

indicadores de mejores prÁcticas empresariales ·...

presentación de powerpoint -...

6 mejores prácticas

presentación mejores prácticas

revista mejores prácticas no. 12

aws summit bogotá track avanzado: mejores prácticas de...

mejores prácticas en comercio electrónico

revista mejores prácticas no. 24

informe del programa mejores prÁcticas de...

mejores prácticas qv10

estudio sobre mejores prácticas

revista mejores prácticas 07

revista mejores prácticas no. 11

guía para mejores prácticas en las operaciones y ......

guía para mejores prácticas en las operaciones y ......

mejores prácticas de iam

cÓdigo de mejores prÁcticas...

primer simposium de mejores prÁcticas de auditorÍa … ·...

mejores técnicas disponibles y mejores prácticas