full stack analytics on amazon web services by ian robinson at big data spain 2017

Post on 17-Mar-2018

459 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Specialist Solutions Architect, Data and Analytics, EMEA

November 17th, 2017

Full Stack Analytics on AWSIan Robinson

Forces and Trends Prompting the Move to Cloud

Cost OptimizationLicensesHardwareData center and operations

Dark DataPrematurely discarding data

AgilityExperimentation (data & tools)Democratised Access to DataTime-to-first-results Terminate failed experiments early

From BI to Data ScienceIn-house data scienceFrom back office to product

Storage is the Gravity for Cloud Applications

Store all your data, for ever, at every stage of its lifecycleApply it using the right tool for the job

Storage is Job #1

Object Storage is Foundational

Standard

Active data Archive dataInfrequently accessed data

Standard - Infrequent Access

Amazon Glacier

Create

Delete

Events and Lifecycle Management

S3 as the Data Lake Fabric

• Unlimited number of objects and volume

• 99.99% availability• 99.999999999% durability• Versioning• Tiered storage via lifecycle

policies• SSL, client/server-side

encryption at rest• Low cost (just over

$2700/month for 100TB)

• Natively supported by big data frameworks (Spark, Hive, Presto, etc)

• Decouples storage and compute• Run transient compute

clusters (with Amazon EC2 Spot Instances)

• Multiple, heterogeneous clusters can use same data

DatabaseMigrationService

Automated Data Ingestion

Stream Events to S3 Using Kinesis Firehose

Write Database Changes to S3 with DMS

<schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv

Full Load

Change Data Capture

Scalable (secure, versioned, durable) storage +Immutable data at every stage of its lifecycle +

Versioned schema and metadata=

Data discovery, lineage

Storage + Catalog

AWS Glue

• Data Catalog Discover and store metadata

• Job Authoring Auto-generated ETL code

• Job Execution Serverless scheduling and execution

Hive metastore-compatible, highly-available metadata repository:• Classification for identifying and

parsing files• Versioning of table metadata as

schemas evolve• Table definitions – usable by

Redshift, Athena, Glue, EMR

Populate using Hive DDL, bulk import, or automatically through crawlers.

Glue Data Catalog

semi-structuredper-file schema

semi-structured unified schema

identify file type and parse files

enumerateS3 objects

file 1

file 2

file N

…int

array

intchar

struct

char int

array

struct

char

bool int

int

arrayint

char

char intcustom classifiers

app log parsermetrics parser

system classifiersJSON parser

CSV parserApache log parser

bool

Crawlers: Automatic Schema Inference

AWS Lambda

AWS Lambda

Metadata Index(Amazon DynamoDB)

Search Index(Amazon Elasticsearch)

ObjectCreatedObjectDeleted PutItem

Update Stream

Update Index

Extract Search Fields

Indexing and Searching Using Metadata

Amazon S3

Security is Job #0

Data Access & AuthorisationGive your users easy and secure access

Storage & CatalogSecure, cost-effective storage in Amazon S3.

Robust metadata in AWS Catalog

Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified

Identity and Access Management

• Manage users,groups,androles• Identityfederation withOpenID• TemporarycredentialswithAmazonSecurityToken

Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies

IAM

AmazonS3

Amazon ElastiCache

AmazonDynamoDB

Amazon EMR

Amazon Kinesis

AmazonAthena

Service API Access

Security at the Data Level

Third Party Ecosystem Security Tools

AmazonS3

AWSCloudTrail

http://amzn.to/2tSimHjAmazonAthena

Access Logging

API Logging

Access Log

Analytics

IAM

Amazon EMR

http://amzn.to/2si6RqS

Storage Level Support for Access Logging and Audit

Encryption Options

AWSServer-Sideencryption• AWSmanagedkeyinfrastructure

AWSKeyManagementService• Automatedkeyrotation&auditing• IntegrationwithotherAWSservices

AWSCloudHSM• DedicatedTenancySafeNet LunaSAHSMDevice• CommonCriteriaEAL4+,NISTFIPS140-2

Serverless Processing and Analytics

• Python code generatedby AWS Glue

• Connect a notebook or IDE to AWS Glue

• Existing code brought into AWS Glue

Managed ETL with AWS GLue

• Schedule-based• Event-based• On demand

Job Execution with AWS Glue

Amazon Kinesis Analytics

• Interact with streaming data in real time using SQL• Build fully managed and elastic stream processing

applications that process data for real-time visualizations and alarms

SELECT STREAM author, count(author) OVER ONE_MINUTE

FROM Tweets WINDOW ONE_MINUTE AS(PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING)WHERE text LIKE ‘%#BigDataSpain%';

Amazon Kinesis Analytics – Simple SQL Interface

Amazon Athena – Analyze Data in S3

• Interactive queries• ANSI SQL• No infrastructure or administration

• Zero spin up time• Query data in its raw format

• AVRO, Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the

best performance and lowest cost• No loading of data, no ETL required

• Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability

Simple query editor with syntax highlighting

and autocomplete

Data Catalog

Query History, Saved Queries, and Catalog Management

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena

Amazon RDS

Amazon S3

Amazon Redshift

Amazon Athena

Using Amazon Athena with Amazon QuickSight

Building Smarter Applications

Add Machine Learning CapabilitiesAmazon Machine Learning ServiceBatch and online predictionsTrain using data in S3, RDS and Redshift

Amazon EMRComprehensive machine learning libraries (eg Spark MLlib, Anaconda)Provision analytics clusters in minutes, autoscale with data volume or query demand

Amazon AI Services

Amazon Polly – Lifelike Text-to-Speech47 voices, 24 languagesLow-latency, real time

Amazon Rekognition – Image AnalysisObject and scene detectionFacial analysis

Amazon Lex – Conversational EngineSpeech and text recognitionEnterprise connectors

Demographic Data

Facial Landmarks

Sentiment Expressed

Image Quality

Facial Analysis with Rekognition

Brightness: 25.84Sharpness: 160

General Attributes

Up to ~40k CUDA coresPre-configured CUDA driversJupyter notebook with Python2, Python3, Anaconda

CloudFormation TemplateAWS Marketplace – one-click deploy

AWS Deep Learning AMI

Kinesis Firehose

AthenaQuery Service Glue

Machine LearningPredictive analytics

Data Access & AuthorisationGive your users easy and secure access

Data IngestionGet your data into S3 quickly and securely

Processing & AnalyticsUse of predictive and prescriptive

analytics to gain better understanding

Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified

Amazon AIStorage & Catalog

Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog

Thank YouFull Stack Analytics on AWS

top related