aws march 2016 webinar series building your data lake on aws
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solution Architect, AWS
March 2016
Replay of BDT317
Building Your Data Lake on AWS
Benefits of the Enterprise Data Warehouse
Self documenting schema
Enforced data types
Ubiquitous and common security model
Simple tools to access, robust ecosystem
Transactionality
Benefits of Separation of Compute & Storage
All your data, without paying for unused cores
Independent cost attribution per dataset
Use the right tool for a job, at the right time
Increased durability without operations
Common model for data, without enforcing access method
Comparison of a Data Lake to an Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only)
Enterprise DWEMR S3
EMR S3
The New Problem
Enterprise data warehouse
≠
Which system has my data?
How can I do machine learning against the DW?
I built this in Hive, can we get it into the Finance reports?
These sources are giving different results…
But I implemented the algorithm in Anaconda…
Dive Into The Data Lake
Enterprise data
warehouse
Load Cleansed Data
Export Computed Aggregates
Ingest any dataData cleansingData catalogue
Trend analysisMachine learning
Structured analysisCommon access toolsEfficient aggregationStructured business rules
EMR S3
Components of a Data Lake
Data Storage
• High durability• Stores raw data from input sources• Support for any type of data• Low cost
Streaming
• Streaming ingest of feed data• Provides the ability to consume any
dataset as a stream• Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Catalogue
• Metadata lake• Used for summary statistics and data
Classification management
Search
• Simplified access model for data discovery
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Entitlements system
• Encryption• Authentication• Authorisation• Chargeback• Quotas• Data masking• Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI API & User Interface
• Exposes the data lake to customers• Programmatically query catalogue• Expose search API• Ensures that entitlements are respected
StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Simple Storage ServiceHighly scalable object storage for the Internet1 byte to 5 TB in sizeDesigned for 99.999999999% durability, 99.99% availabilityRegional service, no single points of failureServer side encryption
Storage
AWS Global Infrastructure
App Services
Deployment & Administration
Networking
Compute Database
Analytics
Data Storage Format
• Not all data formats are created equally• Unstructured vs. semi-structured vs. structured
• Store a copy of raw input• Data standardisation as a workflow following
ingest• Use a format that supports your data, rather
than force your data into a format• Consider how data will change over time• Apply common compression
Consider Different Types of Data
Unstructured• Store native file format (logs, dump files, whatever)• Compress with a streaming codec (LZO, Snappy)
Semi-structured - JSON, XML files, etc.• Consider evolution ability of the data schema (Avro)• Store the schema for the data as a file attribute (metadata/tag)
Structured• Lots of data is CSV!• Columnar storage (Orc, Parquet)
Where to Store Data
Amazon S3 storage uses a flat keyspace
Separate data storage by business unit, application, type, and time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly
Metadata Services
CRUD API
Query API
Analytics API
Systems of Reference
Return URLsURLs as deeplinks to applications, file exchanges via S3 (RESTful file services) or manifests for Big Data Analytics / HPC.
Integration Layer
System to system via Amazon SNS/Amazon SQSSystem to user via mobile push
Amazon Simple Workflow for high level system integration / orchestration
http://en.wikipedia.org/wiki/Resource-oriented_architecture
s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}
Resource Oriented Architecture
StreamingStreaming ingest of feed dataProvides the ability to consume any dataset as a streamFacilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Why Do Streams Matter?
• Latency between event & action• Most BI systems target event to action latency of
1 hour• Streaming analytics would expect event to
action latency < 2 seconds• Stream orientation simplifies architecture, but
can increase operational complexity• Increase in complexity needs to be justified by
business value of reduced latency
Storage
AWS Global Infrastructure
App Services
Deployment & Administration
Networking
Analytics
Compute Database
Amazon KinesisManaged service for real time big data processingCreate streams to produce & consume dataElastically add and remove shards for performanceUse Amazon Kinesis Worker Library to process dataIntegration with S3, Amazon Redshift, and DynamoDB
Data Sources
AWS
Endp
oint
Data Sources
Data Sources
Data Sources
S3
App.1
[Archive/Ingestion]
App.2
[Sliding Window
Analysis]
App.3
[Data Loading]
App.4
[Event Processing Systems]
DynamoDB
Amazon Redshift
Data Sources
Availability Zone
Shard 1
Shard 2Shard N
Availability Zone
Availability Zone
Amazon Kinesis Architecture
Streaming Storage Integration
Object store
Amazon S3Amazon Kinesis
Analytics applicationsRead & write file dataRead & write to streams
Archivestream
Replayhistory
Object store
Catalogue & searchMetadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Building a Data Catalogue
• Aggregated information about your storage & streaming layer
• Storage service for metadataOwnership, data lineage
• Data abstraction layerCustomer data = collection of prefixes
• Enabling data discovery• API for use by entitlements service
Data Catalogue – Metadata Index
• Stores data about your Amazon S3 storage environment• Total size & count of objects by prefix, data classification,
refresh schedule, object version information• Amazon S3 events processed by Lambda function• DynamoDB metadata tables store required attributes
AmazonRDS
AmazonDynamoDB
AmazonRedshift
AmazonElastiCache
Managed NoSQL
Amazon DynamoDBProvisioned throughput NoSQL databaseFast, predictable, configurable performanceFully distributed, fault tolerant HA architectureIntegration with Amazon EMR & HiveStorage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Compute
AWS LambdaFully-managed event processorNode.js or Java, integrated AWS SDKNatively compile & install any Node.js modulesSpecify runtime RAM & timeoutAutomatically scaled to support event volumeEvents from Amazon S3, Amazon SNS, Amazon DynamoDB, Amazon Kinesis, & AWS LambdaIntegrated CloudWatch logging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Serverless Compute
Data Catalogue – Search
Ingestion and pre-processingText processing (normalization)
• Tokenization• Downcasing• Stemming• Stopword removal• Synonym addition
IndexingMatchingRanking and relevance
• TF-IDF • Additional criteria (rating, user behavior,
freshness, etc.)
NoSQLRDBMS Files Any Source
Search Index
Processor
Features and Benefits
Easy to set up and operate • AWS Management Console, SDK, CLI
Scalable• Automatic scaling on data size and traffic
Reliable• Automatic recovery of instances, multi-AZ, etc.
High performance• Low latency and high throughput performance through in-memory caching
Fully managed• No capacity guessing
Rich features• Faceted search, suggestions, relevance ranking, geospatial search, multi-
language support, etc.Cost effective
• Pay as you go
AmazonCloudSearch & ElasticSearch
Data Catalogue – Building Search Index
Enable DynamoDB Update Stream for metadata index table
Additional AWS Lambda function reads Update Stream and extracts index fields from S3 object
Update to Amazon CloudSearch domain
EntitlementsEncryptionAuthenticationAuthorisationChargebackQuotasData maskingRegional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Identity & Access Management
• Manage users, groups, and roles• Identity federation with Open ID• Temporary credentials with Amazon
Security Token Service (Amazon STS)• Stored policy templates• Powerful policy language • Amazon S3 bucket policies
IAM Policy Language
JSON documents
Can include variables which extract information from the request context
aws:CurrentTime For date/time conditions
aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions
aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions.
aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows
aws:SecureTransport Boolean representing whether the request was sent using SSL
aws:SourceIp The requester's IP address, for use with IP address conditions
aws:UserAgent Information about the requester's client application, for use with string conditions
aws:userid The unique ID for the current useraws:username The friendly name of the current user
IAM Policy Language
Example: Allow a user to access a private part of the data lake
{ "Version": "2012-10-17", "Statement": [ { "Action": ["s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake"], "Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}} }, { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"] } ]}
IAM Federation
IAM allows federation to Active Directory and other OpenID providers (Amazon, Facebook, Google)
AWS Directory Service provides an AD Connector which can automate federated connectivity to ADFS
IAM
Users
AWS
Directory Service
AD Connector
Direct Connect
Hardware VPN
Data Encryption
AWS CloudHSMDedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management ServiceAutomated key rotation & auditing
Integration with other AWS services
AWS server side encryptionAWS managed key infrastructure
Entitlements – Access to Encryption Keys
Customer Master Key
Customer Data Keys
CiphertextKey
PlaintextKey
IAM TemporaryCredential
Security Token Service
MyData
MyData
S3
S3 Object…
Name: MyDataKey: Ciphertext Key
…
Secure Data Flow
IAM
Amazon S3
API Gateway
Users
TemporaryCredential
TemporaryCredential
s3://mydatalake/${YYY-MM-DD}/${resource}/${resourceID}
Encrypted Data
Metadata Index -
DynamoDB
TVM - Elastic Beanstalk
Security Token Service
API & UIExposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Data Lake API & UI
Exposes the Metadata API, search, and Amazon S3 storage services to customers
Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata
Drive all UI operations from API?
Introducing Amazon API Gateway
Host multiple versions and stages of APIs
Create and distribute API keys to developers
Leverage AWS Sigv4 to authorize access to APIs
Throttle and monitor requests to protect the backend
Leverages AWS Lambda
Additional Features
Managed cache to store API responses
Reduced latency and DDoS protection through AWS CloudFront
SDK generation for iOS, Android, and JavaScript
Swagger support
Request / response data transformation and API mocking
An API Call Flow
Internet
Mobile Apps
Websites
Services
API Gateway
AWS Lambda functions
AWS
API Gateway cache
Endpoints on Amazon EC2
Any other publicly accessible endpoint
Amazon CloudWatch monitoring
Amazon CloudFront
API & UI Architecture
API Gateway
UI - Elastic Beanstalk
AWS Lambda Metadata IndexUsers IAM
TVM - Elastic Beanstalk
A Data Lake Is…
• A foundation of highly durable data storage and streaming of any type of data
• A metadata index and workflow which helps us categorise and govern data stored in the data lake
• A search index and workflow which enables data discovery• A robust set of security controls – governance through technology,
not policy• An API and user interface that expose these features to internal and
external users
Storage & Streams
Amazon Kinesis
Amazon S3 Amazon Glacier
Data Catalogue & Search
AWS Lambda
Search Index Metadata Index
API GatewayUsers UI - Elastic Beanstalk
Entitlements
IAM
Encrypted Data
Security Token Service
TVM - Elastic Beanstalk
KMS
API & UI