launching your big data project on aws - amazon s3 · amazon kinesis - stream processing on aws...

60
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ganesh Raja Specialist Solutions Architect Data & Analytics Launching your Big Data Project on AWS

Upload: others

Post on 20-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ganesh Raja

Specialist Solutions Architect – Data & Analytics

Launching your Big Data Project on AWS

Page 2: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Traditionally, Analytics Used to Look Like This

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence • Relational data

• TBs–PBs scale

• Schema defined prior to data load

• Operational reporting and ad hoc

• Large initial CAPEX + $10K–$50K/TB/Year

Page 3: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes Extend the Traditional Approach

Data Warehouse

Business Intelligence

OLTP ERP CRM LOB

• Relational and non-relational data

• TBs–EBs scale

• Diverse analytical engines

• Low-cost storage & analytics

Devices Web Sensors Social

Big Data processing,

real-time, Machine Learning

Data Lake

Page 4: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes and Analytics from AWS

Cost-effective

Scalable and durable

Secure

Open and comprehensiveAnalyticsMachine Learning

Real-time Data Movement

On-premisesData Movement

Data Lake on AWS

Page 5: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

A m a z o n S 3

A m a z o n G l a c i e r

A W S G l u e

Store Data in the Format You WantOpen and comprehensive

• Store data in the format you want:

• Text files like CSV

• Columnar like Apache Parquet, and Apache ORC

• Logstash like Grok

• JSON (simple, nested), AVRO

• And more…

CSV

ORC

Grok

Avro

Parquet

JSON

Page 6: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes from AWS

Data Lake on AWS

Cost-effective

Scalable and durable

Secure

Open and comprehensiveAnalyticsMachine Learning

Real-time Data Movement

On-premisesData Movement

Page 7: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Provides Highest Levels of SecuritySecure

Compliance

AWS Artifact

Amazon Inspector

Amazon Cloud HSM

Amazon Cognito

AWS CloudTrail

Security

Amazon GuardDuty

AWS Shield

AWS WAF

Amazon Macie

VPC

Encryption

AWS Certification Manager

AWS Key Management

Service

Encryption at rest

Encryption in transit

Bring your own keys, HSM

support

Identity

AWS IAM

AWS SSO

Amazon Cloud Directory

AWS Directory Service

AWS Organizations

Customer need to have multiple levels of security, identity and access management,

encryption, and compliance to secure their data lake

Page 8: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Security: Machine Learning-Powered SecuritySecure

• Machine learning to discover, classify,

and protect data

• Continuously monitors data access for anomalies

• Generates alerts when it detects

unauthorized access

• Recognizes PII or intellectual propertyAmazon Macie

Page 9: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Encryption: Data-at-Rest and in Motion Secure

• Only cloud that offers three forms of encryption

• Server-side encryption

• Encryption with keys managed by the

AWS Key Management Service

• Encryption with keys that customers manage

• Only cloud that encrypts data in transit when replicating

across regions

• Data movement services can use the same Key

Management Service

• SSL endpoints

Page 10: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Compliance: Virtually Every Regulatory Agency

CSACloud Security

Alliance Controls

ISO 9001Global Quality

Standard

ISO 27001Security Management

Controls

ISO 27017Cloud Specific

Controls

ISO 27018Personal Data

Protection

PCI DSS Level 1Payment Card

Standards

SOC 1Audit Controls

Report

SOC 2Security, Availability, &

Confidentiality Report

SOC 3General Controls

Report

Global United States

CJISCriminal Justice

Information Services

DoD SRGDoD Data

Processing

FedRAMPGovernment Data

Standards

FERPAEducational

Privacy Act

FIPSGovernment Security

Standards

FISMAFederal Information

Security Management

GxPQuality Guidelines

and Regulations

ISO FFIECFinancial Institutions

Regulation

HIPPAProtected Health

Information

ITARInternational Arms

Regulations

MPAAProtected Media

Content

NISTNational Institute of

Standards and Technology

SEC Rule 17a-4(f)Financial Data

Standards

VPAT/Section 508Accountability

Standards

Asia Pacific

FISC [Japan]Financial Industry

Information Systems

IRAP [Australia]Australian Security

Standards

K-ISMS [Korea]Korean Information

Security

MTCS Tier 3 [Singapore]Multi-Tier Cloud

Security Standard

My Number Act [Japan]Personal Information

Protection

Europe

C5 [Germany]Operational Security

Attestation

Cyber Essentials Plus [UK]Cyber Threat

Protection

G-Cloud [UK]UK Government

Standards

IT-Grundschutz

[Germany]Baseline Protection

Methodology

X P

G

Page 11: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes from AWS

Data Lake on AWS

Cost-effective

Scalable and durable

Secure

Open and comprehensiveAnalyticsMachine Learning

Real-time Data Movement

On-premisesData Movement

Page 12: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Any ScaleScalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Run analytic engines at largest scale by spinning

up any amount of compute resources in minutes

• Runs on the world’s largest global

cloud infrastructure

Page 13: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Unmatched Durability and AvailabilityScalable and durable

• Designed to deliver 99.999999999% durability

• Geographic redundancy & automatic replication

• Store data in multiple data centers across 3 AZs

in a single region

• Seamlessly replicates data between any region

Page 14: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes from AWS

Data Lake on AWS

Lowest cost

Scalable and durable

Secure

Open and comprehensiveAnalyticsMachine Learning

Real-time Data Movement

On-premisesData Movement

Page 15: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Tiered Storage to Optimize Price/PerformanceLowest Cost

• Tiered storage to optimize price/performance• S3 Standard

• S3 Standard—Infrequent Access

• S3 One Zone—Infrequent Access

• Amazon Glacier

• Migrate between tiers based on lifecycle policies

• Store data at $0.023/GB/month with S3

• Store data at $0.004/GB/month with Glacier

S3

StandardS3 Standard

Infrequent Access

S3 One Zone-IA

Glacier

Page 16: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes, Analytics, and ML Portfolio from AWSBroadest, deepest set of analytic services

Amazon SageMaker

AWS Deep Learning AMIs

Amazon Rekognition

Amazon Lex

AWS DeepLens

Amazon Comprehend

Amazon Translate

Amazon Transcribe

Amazon Polly

Amazon Athena

Amazon EMR

Amazon Redshift

Amazon Elasticsearch service

Amazon Kinesis

Amazon QuickSight

AnalyticsMachine Learning

AWS Direct Connect

AWS Snowball

AWS Snowmobile

AWS Database Migration Service

AWS Storage Gateway

AWS IoT Core

Amazon Kinesis Data Firehose

Amazon Kinesis Data Streams

Amazon Kinesis Video Streams

Real-time Data Movement

On-premises Data Movement

Data Lake on AWSStorage | Archival Storage | Data Catalog

Page 17: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Sources

FilesLogsStreamsDatabases

Page 18: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Sources - Databases

Amazon S3Databases

Page 19: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Change Data Capture – Database Logs

LOG_FILE_HDR_SIZE

OS_FILE_LOG_BLOCK

_SIZE

FORMAT

CHECKSUM

LOG_CHECKPOINT_1

LOG_CHECKPOINT_2

Checkpoint_lsn

Checkpoing_no

Log.buf_size

LOG BLOCK

LOG_BLOCK_HDR_SIZ

E

Hdr_no

Flush_bit

Data_len

[…]

???

Tx001.log

Page 20: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Database Migration Service (DMS) easily and

securely migrate and/or replicate your databases and

data warehouses to AWS

Database Migration Service(Also good for ingestion!)

Page 21: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

DMS – Supported Data Sources

On-Premise or EC2 Amazon RDS On Azure

Oracle * Oracle * Azure SQL Database (no CDC)

MS SQL Server * MS SQL Server *

MySQL (5.5+) MySQL (CDC on 5.6+)

MariaDB MariaDB

PostgreSQL (9.4+) PostgreSQL (CDC on 9.4.9+,

9.5.4+)

SAP Adaptive Server

Enterprise

Amazon Aurora – MySQL

MongoDB (2.6.x, 3.x+)

Db2 LUW

Page 22: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

DMS – Deployment

Amazon S3

Availability Zone Availability Zone

VPC subnet VPC subnet

Replication

Master

Replication

Slave

Page 23: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Sources - Files

Amazon S3Files

Page 24: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Uploading to Amazon S3

• Amazon S3 supports both a single-part upload

and a multi-part upload API

• The single-part upload supports objects up to

5 GB in size

• The multi-part upload supports objects up to 5

TB in size

• The multi-part upload also enables you to

maximize your throughput by using parallel

threads

Page 25: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

PUT requests go through the nearest AWS Edge

Location

Data transits over the AWS private network rather

than Internet

AWS private network optimizes throughput and

latency to the AWS Region

Data is not stored in the edge cache

S3 Transfer Acceleration

S3 bucketAWS edge

location

Uploader

Page 26: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Sources - Streams

Amazon S3Streams

Page 27: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Kinesis - Stream Processing on AWS

Kinesis StreamsCapture streaming data for downstream processing

Allow multiple processors to read streams at their own rate

Kinesis Firehose

• Buffer records in a stream into a single output for more efficient storage

• Automatic flushing of buffer to S3, ElasticSearch, Redshift, or Splunk

Kinesis Analytics

• Create time windows over streams and perform aggregate operations using SQL

• Join together multiple streams and output to new streams

Page 28: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kinesis – How it works

Millions of sources

producing 100’s of

terabytes per hour

FrontEnd

AZ AZ AZAuthenticate

Authorize

Durable, highly consistent storage replicas data

across three AWS Availability Zones

Aggregate

and archive

to S3

Real time

dashboards

and alarms

Machine learning

algorithms or sliding

window analytics

Aggregate analysis

in Hadoop or a

data warehouse

Ordered stream of

events supports

multiple readers

Page 29: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Sources - Logs

Amazon S3Logs

Page 30: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Logs

Collecting and Analyzing

• CloudWatch

• Amazon Kinesis

• Other Options

Page 31: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Logs – CloudWatch Agent

EC2 Instances

CloudWatch Log Stream AWS Lambda Amazon S3

Page 32: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Summary - Ingestion

s3://datalake/

/vendorfeeds

/vendorA

/vendorB

/clickstream

/orders

/vendors

/customers

/app_logs

/instance1

/instance2

/syslogs

/instance1

/instance2

/databases

/customers

/orders

/vendors

File Gateway

API Gateway

Kinesis Agent

DMS

Kinesis Firehose

Amazon S3

Files

Streams

Logs

Databases

Page 33: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes, Analytics, and ML Portfolio from AWSBroadest, deepest set of analytic services

Amazon SageMaker

AWS Deep Learning AMIs

Amazon Rekognition

Amazon Lex

AWS DeepLens

Amazon Comprehend

Amazon Translate

Amazon Transcribe

Amazon Polly

Amazon Athena

Amazon EMR

Amazon Redshift

Amazon Elasticsearch service

Amazon Kinesis

Amazon QuickSight

AnalyticsMachine Learning

AWS Direct Connect

AWS Snowball

AWS Snowmobile

AWS Database Migration Service

AWS Storage Gateway

AWS IoT Core

Amazon Kinesis Data Firehose

Amazon Kinesis Data Streams

Amazon Kinesis Video Streams

Real-time Data Movement

On-premises Data Movement

Data Lake on AWSStorage | Archival Storage | Data Catalog

Page 34: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3—The Data Lake

Security and

Compliance

Three different forms of

encryption; encrypts data

in transit when

replicating across regions;

log and monitor with

CloudTrail, use ML to

discover and protect

sensitive data with Macie

Flexible Management

Classify, report, and

visualize data usage

trends; objects can be

tagged to see storage

consumption, cost, and

security; build lifecycle

policies to automate

tiering, and retention

Durability, Availability

& Scalability

Built for eleven nine’s of

durability; data

distributed across 3

physical facilities in an

AWS region;

automatically replicated

to any other AWS region

Query in Place

Run analytics & ML on

data lake without data

movement; S3 Select can

retrieve subset of data,

improving analytics

performance by 400%

Page 35: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Glacier—Backup and Archive

Durability, Availability

& Scalability

Built for eleven nine’s of

durability; data

distributed across 3

physical facilities in an

AWS region;

automatically replicated

to any other AWS region

Secure

Log and monitor with

CloudTrail, Vault Lock

enables WORM storage

capabilities, helping

satisfy compliance

requirements

Retrieves data in

minutes

Three retrieval options to

fit your use case;

expedited retrievals with

Glacier Select can return

data in minutes

Inexpensive

Lowest cost AWS object

storage class, allowing

you to archive large

amounts of data at a very

low cost

$

Page 36: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Storing is Not Enough, Data Needs to Be Discoverable

Dark data are the information

assets organizations collect,

process, and store during

regular business activities,

but generally fail to use for

other purposes (for example,

analytics, business relationships

and direct monetizing).

CRM ERP Data warehouse Mainframe

data

Web Social Log

files

Machine

data

Semi-

structuredUnstructured

”Gartner IT Glossary, 2018

https://www.gartner.com/it-glossary/dark-data

Page 37: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Glue—Data CatalogMake data discoverable

• Automatically discovers data and stores schema

• Catalog makes data searchable, and available for ETL

• Catalog contains table and job definitions

• Computes statistics to make queries efficient

Glue

Data Catalog

Discover data and

extract schema

Compliance

Page 38: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Glue Data

Catalog

Glue: Data Catalog – Queryable by Many Services

Glue ETL

Amazon Athena

Redshift Spectrum

EMR

(Hadoop/Spark)

Page 39: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Glue—ETL ServiceMake ETL scripting and deployment easy

• Automatically generates ETL code

• Code is customizable with Python

and Spark

• Endpoints provided to edit, debug,

test code

• Jobs are scheduled or event-based

• Serverless

Page 40: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Auto-configure VPC and role-based access

Customers can specify the capacity that

gets allocated to each job

Automatically scale resources (on post-GA

roadmap)

You pay only for the resources you

consume while consuming them

There is no need to provision, configure, or

manage servers

Customer VPC Customer VPC

Compute instances

AWS Glue: Job Execution - Serverless

Page 41: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Glue: Overall Flow

1. Crawl your raw

data

2. Create your desired targets

3. Generate and prep your ETL

4. Execute your job

Page 42: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes, Analytics, and ML Portfolio from AWSBroadest, deepest set of analytic services

Amazon SageMaker

AWS Deep Learning AMIs

Amazon Rekognition

Amazon Lex

AWS DeepLens

Amazon Comprehend

Amazon Translate

Amazon Transcribe

Amazon Polly

Amazon Athena

Amazon EMR

Amazon Redshift

Amazon Elasticsearch service

Amazon Kinesis

Amazon QuickSight

AnalyticsMachine Learning

AWS Direct Connect

AWS Snowball

AWS Snowmobile

AWS Database Migration Service

AWS Storage Gateway

AWS IoT Core

Amazon Kinesis Data Firehose

Amazon Kinesis Data Streams

Amazon Kinesis Video Streams

Real-time Data Movement

On-premises Data Movement

Data Lake on AWSStorage | Archival Storage | Data Catalog

Page 43: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Athena—Interactive Analysis

Interactive query service to analyze data in Amazon S3 using standard SQL

No infrastructure to set up or manage and no data to load

Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

Query Instantly

Zero setup cost; just

point to S3 and

start querying

SQL

Open

ANSI SQL interface,

JDBC/ODBC drivers,

multiple formats,

compression types,

and complex joins and

data types

Easy

Serverless: zero

infrastructure, zero

administration

Integrated with

QuickSight

Pay per query

Pay only for queries

run; save 30–90% on

per-query costs

through compression

$

Page 44: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 45: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 46: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Redshift—Data Warehousing

Fast at scale

Columnar storage

technology to improve

I/O efficiency and scale

query performance

Secure

Audit everything; encrypt

data end-to-end;

extensive certification

and compliance

Open file formats

Analyze optimized data

formats on the latest

SSD, and all open data

formats in Amazon S3

Inexpensive

As low as $1,000 per

terabyte per year, 1/10th

the cost of traditional

data warehouse

solutions; start at $0.25

per hour

$

Page 47: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Redshift SpectrumExtend the data warehouse to exabytes of data in S3 data lake

S3 data lakeRedshift data

Redshift Spectrum

query engine• Exabyte Redshift SQL queries against S3

• Join data across Redshift and S3

• Scale compute and storage separately

• Stable query performance and unlimited concurrency

• CSV, ORC, Grok, Avro, & Parquet data formats

• Pay only for the amount of data scanned

Page 48: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon EMR—Big Data Processing

Low cost

Flexible billing with per-

second billing, EC2 spot,

reserved instances and

auto-scaling to reduce

costs 50–80%

$

Easy

Launch fully managed

Hadoop & Spark in

minutes; no cluster

setup, node provisioning,

cluster tuning

Latest versions

Updated with the latest

open source frameworks

within 30 days of release

Use S3 storage

Process data directly in

the S3 data lake securely

with high performance

using the EMRFS

connector

Data Lake

100110000100101011

100101010111001010

100000111100101100

101010001100001

Page 49: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Elasticsearch Service

Easy to Use

Fully managed;

Deploy production-ready

clusters in minutes

Secure

Secure access with VPC to

keep all traffic within

AWS network

Open

Direct access to

Elasticsearch open-source

APIs; supports Logstash

and Kibana

Available

Zone awareness

replicates data between

two AZs; automatically

monitors & replaces

failed nodes

$

Page 50: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon QuickSight

easy

Empower

everyone

Seamless

connectivity

Fast analysis Serverless

Page 51: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

One product for all of your users

QuickSight covers all of your users from casual data consumers, to dashboard creators, to power users and analysts that need self-serve analytics.

ExploreGive power users and analysts the freedom

to do their own self-serve data discovery

and analysis on governed data you control

CreateCreate and publish rich, interactive

dashboards to all of your users

ConsumeWith the new Reader Role, you can provide

everyone in your organization with secure,

easy access to interactive dashboards and

reports, on any device

Page 52: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Introducing Pay-per-Session pricing for Readers!

Pay-per-Session pricing for Readers starts at $0.30 per session up to a max of $5/user/month for unlimited sessions for data consumers that interact with published dashboards and reports.

Page 53: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes, Analytics, and ML Portfolio from AWSBroadest, deepest set of analytic services

Amazon SageMaker

AWS Deep Learning AMIs

Amazon Rekognition

Amazon Lex

AWS DeepLens

Amazon Comprehend

Amazon Translate

Amazon Transcribe

Amazon Polly

Amazon Athena

Amazon EMR

Amazon Redshift

Amazon Elasticsearch service

Amazon Kinesis

Amazon QuickSight

AnalyticsMachine Learning

AWS Direct Connect

AWS Snowball

AWS Snowmobile

AWS Database Migration Service

AWS Storage Gateway

AWS IoT Core

Amazon Kinesis Data Firehose

Amazon Kinesis Data Streams

Amazon Kinesis Video Streams

Real-time Data Movement

On-premises Data Movement

Data Lake on AWSStorage | Archival Storage | Data Catalog

Page 54: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS offers a range of tools to make AI/ML more accessible

PollyLex Rekognition

Deep Learning FrameworksAmazon AI/ML Services

Usability/simplicity:

leverages AWS AI/ML expertise

Greater control:

customer-specific models

These solutions are underpinned by proven, scalable AWS products and services

AWS

GreengrassAWS

IoTAWS

Lambda

Amazon EC2

(P2 and G2 GPUs)

Amazon

S3

Amazon

DynamoDBAmazon

Redshift

Amazon EC2

(CPUs)

Amazon EC2

(ENA)

Rekognition

Video

Machine Learning Platforms

Amazon ML

Spark & EMR

Kinesis

Batch

ECS

Connect Transcribe Translate ComprehendSageMaker

DeepLens

Apache MXNet

TensorFlow

Caffe/Caffe2

Theano

Keras

Torch

Cognitive Toolkit

Page 55: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon SageMakerThe quickest and easiest way to get ML models from idea to production

NEW!

Zero setup

Flexible Model Training

End-to-End Machine Learning

Platform

Pay by the second

$

Page 56: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

These tools come together to form a full AI/ML stack

Platforms

Amazon ML ECSSpark & EMR Kinesis Batch

Infrastructure GPU MobileCPU IoT

Services

Lex

Polly

Rekognition

Frameworks Apache

MXNetTorch

Cognitive

ToolkitKerasTheano

Caffe2

& CaffeTensorFlow

AWS Deep Learning AMI

We support all major frameworks to provide our customers with the best tool for the job.

Rekognition Video

ConnectTranscribe

Translate Comprehend

SageMaker DeepLens ML Solutions Lab

Page 57: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

ML Solutions Lab lets you leverage Amazon expertise

Companies have

numerous

opportunities for

Machine Learning

And are unable to

unlock business

potential

Brainstorming Modeling Teaching

But lack ML

expertise or scale

Leverage Amazon experts with decades of ML

experience with technologies like Amazon Echo,

Amazon Alexa, Prime Air, and Amazon GoAmazon ML Lab

provides the missing

ML expertise

Engage the ML Solutions Lab to harness the business value of your data

Page 58: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Processing & Analytics

Transactional & RDBMS

DynamoDB

NoSQL DB Relational Database

Aurora

BI & Data Visualization

Kinesis Streams

& Firehose

Batch

EMR

Hadoop, Spark,

Presto

Redshift

Data Warehouse

Athena

Query Service

AWS Batch

Predictive

Real-time

AWS LambdaApache Storm

on EMR

Apache Flink

on EMR

Spark Streaming

on EMR

Elasticsearch

ServiceKinesis Analytics,

Kinesis Streams

ElastiCache DAX

In Summary…

Page 59: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Training Offer

Make your data driven decisions count, and make a career in Big Data on AWS. Follow the Big Data Specialty learning path and become a specialist in Big Data:

• Implement core AWS Big Data services according to best practices

• Design and maintain Big Data

• Leverage tools to automate data analysis

Certified Cloud

PractitionerAssociate-level Certification

AWS Certified Big Data - Specialty

• Enterprise solutions

architects

• Data scientists

• Big Data solutions

architects

• Data analysts

Who should attend

Free AWS digital training: Foundational

knowledge

Big Data on AWS – 3-day Classroom Training

Free AWS digital training:

Big Data Technology Fundamentals

Visit www.aws.training to find out more.

Page 60: Launching your Big Data Project on AWS - Amazon S3 · Amazon Kinesis - Stream Processing on AWS Kinesis Streams Capture streaming data for downstream processing Allow multiple processors

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

We hope you found it interesting! A kind reminder to complete the survey.

Let us know what you thought of today’s event and how we can improve the event

experience for you in the future.

[email protected]

twitter.com/AWSCloud

facebook.com/AmazonWebServices

youtube.com/user/AmazonWebServices

slideshare.net/AmazonWebServices

twitch.tv/aws

Thank You For Attending

AWS Data Driven Decisions Webinar Series.