aws march 2016 webinar series building your data lake on aws

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ian Meyers, Principal Solution Architect, AWS

March 2016

Replay of BDT317

Building Your Data Lake on AWS

Benefits of the Enterprise Data Warehouse

Self documenting schema

Enforced data types

Ubiquitous and common security model

Simple tools to access, robust ecosystem

Transactionality

STORAGECOMPUTE

But customers have additional requirements…

The Rise of “Big Data”

Enterprise data warehouse

Amazon EMR

Amazon S3

STORAGECOMPUTE

COMPUTE COMPUTE

COMPUTECOMPUTE

COMPUTE

COMPUTE

COMPUTE

Benefits of Separation of Compute & Storage

All your data, without paying for unused cores

Independent cost attribution per dataset

Use the right tool for a job, at the right time

Increased durability without operations

Common model for data, without enforcing access method

Comparison of a Data Lake to an Enterprise Data Warehouse

Complementary to EDW (not replacement) Data lake can be source for EDW

Schema on read (no predefined schemas) Schema on write (predefined schemas)

Structured/semi-structured/Unstructured data Structured data only

Fast ingestion of new data/content Time consuming to introduce new content

Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)

Data at low level of detail/granularity Data at summary/aggregated level of detail

Loosely defined SLAs Tight SLAs (production schedules)

Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only)

Enterprise DWEMR S3

EMR S3

The New Problem

Enterprise data warehouse

≠

Which system has my data?

How can I do machine learning against the DW?

I built this in Hive, can we get it into the Finance reports?

These sources are giving different results…

But I implemented the algorithm in Anaconda…

Dive Into The Data Lake

≠Enterprise

data warehouse

EMR S3

Dive Into The Data Lake

Enterprise data

warehouse

Load Cleansed Data

Export Computed Aggregates

Ingest any dataData cleansingData catalogue

Trend analysisMachine learning

Structured analysisCommon access toolsEfficient aggregationStructured business rules

EMR S3

Components of a Data Lake

Data Storage

• High durability• Stores raw data from input sources• Support for any type of data• Low cost

Streaming

• Streaming ingest of feed data• Provides the ability to consume any

dataset as a stream• Facilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI


Catalogue

• Metadata lake• Used for summary statistics and data

Classification management

Search

• Simplified access model for data discovery

Storage & Streams

Catalogue & Search

Entitlements

API & UI


Entitlements system

• Encryption• Authentication• Authorisation• Chargeback• Quotas• Data masking• Regional restrictions

Storage & Streams

Catalogue & Search

Entitlements

API & UI


Storage & Streams

Catalogue & Search

Entitlements

API & UI API & User Interface

• Exposes the data lake to customers• Programmatically query catalogue• Expose search API• Ensures that entitlements are respected

StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Simple Storage ServiceHighly scalable object storage for the Internet1 byte to 5 TB in sizeDesigned for 99.999999999% durability, 99.99% availabilityRegional service, no single points of failureServer side encryption

Storage

AWS Global Infrastructure

App Services

Deployment & Administration

Networking

Compute Database

Analytics

Storage Lifecycle Integration

S3 – Standard S3 – Infrequent Access Amazon Glacier

Data Storage Format

• Not all data formats are created equally• Unstructured vs. semi-structured vs. structured

• Store a copy of raw input• Data standardisation as a workflow following

ingest• Use a format that supports your data, rather

than force your data into a format• Consider how data will change over time• Apply common compression

Consider Different Types of Data

Unstructured• Store native file format (logs, dump files, whatever)• Compress with a streaming codec (LZO, Snappy)

Semi-structured - JSON, XML files, etc.• Consider evolution ability of the data schema (Avro)• Store the schema for the data as a file attribute (metadata/tag)

Structured• Lots of data is CSV!• Columnar storage (Orc, Parquet)

Where to Store Data

Amazon S3 storage uses a flat keyspace

Separate data storage by business unit, application, type, and time

Natural data partitioning is very useful

Paths should be self documenting and intuitive

Changing prefix structure in future is hard/costly

Metadata Services

CRUD API

Query API

Analytics API

Systems of Reference

Return URLsURLs as deeplinks to applications, file exchanges via S3 (RESTful file services) or manifests for Big Data Analytics / HPC.

Integration Layer

System to system via Amazon SNS/Amazon SQSSystem to user via mobile push

Amazon Simple Workflow for high level system integration / orchestration

http://en.wikipedia.org/wiki/Resource-oriented_architecture

s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}

Resource Oriented Architecture

StreamingStreaming ingest of feed dataProvides the ability to consume any dataset as a streamFacilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Why Do Streams Matter?

• Latency between event & action• Most BI systems target event to action latency of

1 hour• Streaming analytics would expect event to

action latency < 2 seconds• Stream orientation simplifies architecture, but

can increase operational complexity• Increase in complexity needs to be justified by

business value of reduced latency

Storage


App Services


Networking

Analytics

Compute Database

Amazon KinesisManaged service for real time big data processingCreate streams to produce & consume dataElastically add and remove shards for performanceUse Amazon Kinesis Worker Library to process dataIntegration with S3, Amazon Redshift, and DynamoDB

Data Sources

AWS

Endp

oint

Data Sources

Data Sources

Data Sources

S3

App.1

[Archive/Ingestion]

App.2

[Sliding Window

Analysis]

App.3

[Data Loading]

App.4

[Event Processing Systems]

DynamoDB

Amazon Redshift

Data Sources

Availability Zone

Shard 1

Shard 2Shard N

Availability Zone

Availability Zone

Amazon Kinesis Architecture

Streaming Storage Integration

Object store

Amazon S3Amazon Kinesis

Analytics applicationsRead & write file dataRead & write to streams

Archivestream

Replayhistory

Object store

Catalogue & searchMetadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Building a Data Catalogue

• Aggregated information about your storage & streaming layer

• Storage service for metadataOwnership, data lineage

• Data abstraction layerCustomer data = collection of prefixes

• Enabling data discovery• API for use by entitlements service

Data Catalogue – Metadata Index

• Stores data about your Amazon S3 storage environment• Total size & count of objects by prefix, data classification,

refresh schedule, object version information• Amazon S3 events processed by Lambda function• DynamoDB metadata tables store required attributes

http://amzn.to/1LSSbFp

AmazonRDS

AmazonDynamoDB

AmazonRedshift

AmazonElastiCache

Managed NoSQL

Amazon DynamoDBProvisioned throughput NoSQL databaseFast, predictable, configurable performanceFully distributed, fault tolerant HA architectureIntegration with Amazon EMR & HiveStorage


Database

App Services


Networking

Analytics

Compute

AWS LambdaFully-managed event processorNode.js or Java, integrated AWS SDKNatively compile & install any Node.js modulesSpecify runtime RAM & timeoutAutomatically scaled to support event volumeEvents from Amazon S3, Amazon SNS, Amazon DynamoDB, Amazon Kinesis, & AWS LambdaIntegrated CloudWatch logging

Compute Storage


Database

App Services


Networking

Analytics

Serverless Compute

Data Catalogue – Search

Ingestion and pre-processingText processing (normalization)

• Tokenization• Downcasing• Stemming• Stopword removal• Synonym addition

IndexingMatchingRanking and relevance

• TF-IDF • Additional criteria (rating, user behavior,

freshness, etc.)

NoSQLRDBMS Files Any Source

Search Index

Processor

Features and Benefits

Easy to set up and operate • AWS Management Console, SDK, CLI

Scalable• Automatic scaling on data size and traffic

Reliable• Automatic recovery of instances, multi-AZ, etc.

High performance• Low latency and high throughput performance through in-memory caching

Fully managed• No capacity guessing

Rich features• Faceted search, suggestions, relevance ranking, geospatial search, multi-

language support, etc.Cost effective

• Pay as you go

AmazonCloudSearch & ElasticSearch

Data Catalogue – Building Search Index

Enable DynamoDB Update Stream for metadata index table

Additional AWS Lambda function reads Update Stream and extracts index fields from S3 object

Update to Amazon CloudSearch domain

Catalogue & Search Architecture

EntitlementsEncryptionAuthenticationAuthorisationChargebackQuotasData maskingRegional restrictions

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Data Lake ≢ Open Access

Identity & Access Management

• Manage users, groups, and roles• Identity federation with Open ID• Temporary credentials with Amazon

Security Token Service (Amazon STS)• Stored policy templates• Powerful policy language • Amazon S3 bucket policies

IAM Policy Language

JSON documents

Can include variables which extract information from the request context

aws:CurrentTime For date/time conditions

aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions

aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions.

aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows

aws:SecureTransport Boolean representing whether the request was sent using SSL

aws:SourceIp The requester's IP address, for use with IP address conditions

aws:UserAgent Information about the requester's client application, for use with string conditions

aws:userid The unique ID for the current useraws:username The friendly name of the current user

IAM Policy Language

Example: Allow a user to access a private part of the data lake

{ "Version": "2012-10-17", "Statement": [ { "Action": ["s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake"], "Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}} }, { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"] } ]}

IAM Federation

IAM allows federation to Active Directory and other OpenID providers (Amazon, Facebook, Google)

AWS Directory Service provides an AD Connector which can automate federated connectivity to ADFS

IAM

Users

AWS

Directory Service

AD Connector

Direct Connect

Hardware VPN

Extended user defined security

Entitlements Engine: Amazon STS Token Vending Machinehttp://amzn.to/1FMPrTF

Data Encryption

AWS CloudHSMDedicated Tenancy SafeNet Luna SA HSM Device

Common Criteria EAL4+, NIST FIPS 140-2

AWS Key Management ServiceAutomated key rotation & auditing

Integration with other AWS services

AWS server side encryptionAWS managed key infrastructure

Entitlements – Access to Encryption Keys

Customer Master Key

Customer Data Keys

CiphertextKey

PlaintextKey

IAM TemporaryCredential

Security Token Service

MyData

MyData

S3

S3 Object…

Name: MyDataKey: Ciphertext Key

…

Secure Data Flow

IAM

Amazon S3

API Gateway

Users

TemporaryCredential

TemporaryCredential

s3://mydatalake/${YYY-MM-DD}/${resource}/${resourceID}

Encrypted Data

Metadata Index -

DynamoDB

TVM - Elastic Beanstalk


API & UIExposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Data Lake API & UI

Exposes the Metadata API, search, and Amazon S3 storage services to customers

Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata

Drive all UI operations from API?

Amazon API Gateway

Introducing Amazon API Gateway

Host multiple versions and stages of APIs

Create and distribute API keys to developers

Leverage AWS Sigv4 to authorize access to APIs

Throttle and monitor requests to protect the backend

Leverages AWS Lambda

Additional Features

Managed cache to store API responses

Reduced latency and DDoS protection through AWS CloudFront

SDK generation for iOS, Android, and JavaScript

Swagger support

Request / response data transformation and API mocking

An API Call Flow

Internet

Mobile Apps

Websites

Services

API Gateway

AWS Lambda functions

AWS

API Gateway cache

Endpoints on Amazon EC2

Any other publicly accessible endpoint

Amazon CloudWatch monitoring

Amazon CloudFront

API & UI Architecture

API Gateway

UI - Elastic Beanstalk

AWS Lambda Metadata IndexUsers IAM


Putting It All Together

A Data Lake Is…

• A foundation of highly durable data storage and streaming of any type of data

• A metadata index and workflow which helps us categorise and govern data stored in the data lake

• A search index and workflow which enables data discovery• A robust set of security controls – governance through technology,

not policy• An API and user interface that expose these features to internal and

external users

Storage & Streams

Amazon Kinesis

Amazon S3 Amazon Glacier

Data Catalogue & Search

AWS Lambda

Search Index Metadata Index

API GatewayUsers UI - Elastic Beanstalk

Entitlements

IAM

Encrypted Data



KMS

API & UI

Thank you!

Ian Meyers, Principal Solution Architect

aws march 2016 webinar series building your data lake on aws

Technology