building a data lake on aws
TRANSCRIPT
![Page 1: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/1.jpg)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steve Abraham, Solutions Architect
October 26, 2016
Building a Data Lake on AWS
![Page 2: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/2.jpg)
Evolution of “Data Lakes”
![Page 3: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/3.jpg)
Databases
Transactions
Data warehouse
Evolution of big data architecture
Extract, transform and load (ETL)
![Page 4: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/4.jpg)
Databases
Files
Transactions
Logs
Data warehouse
Evolution of big data architecture
ETL
ETL
![Page 5: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/5.jpg)
Databases
Files
Streams
Transactions
Logs
Events
Data warehouse
Evolution of big data architecture
? Hadoop ?
ETL
ETL
![Page 6: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/6.jpg)
Amazon Glacier
Amazon S3 Amazon DynamoDB
Amazon RDS
Amazon EMR
Amazon Redshift
AWS Data Pipeline
Amazon Kinesis Amazon CloudSearch
Amazon Kinesis-enabled app
AWS Lambda Amazon Machine Learning
Amazon SQS
Amazon ElastiCache
Amazon DynamoDBStreams
A growing ecosystem…
![Page 7: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/7.jpg)
Databases
Files
Streams
Transactions
Logs
Events
Data warehouse
DataLake
The Genesis of “Data Lakes”
![Page 8: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/8.jpg)
What really is a “Data Lake”
![Page 9: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/9.jpg)
Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI An API and user interface that expose these features to internal and external users
A robust set of security controls – governance through technology, not policy
A search index and workflow which enables data discovery
A foundation of highly durable data storage and streaming of any type of data
![Page 10: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/10.jpg)
StorageHigh durabilityStores raw data from input sourcesSupport for any type of dataLow cost
![Page 11: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/11.jpg)
Data Lake – Hadoop (HDFS) as the Storage
Search
Access
QueryProcess
Archive
![Page 12: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/12.jpg)
Transactions
Data Lake – Amazon S3 as the storage
Search
Access
QueryProcess
Archive
Amazon RDS
Amazon DynamoDB
AmazonElasticsearch
Service
AmazonGlacier
Amazon S3
Amazon Redshift
Amazon Elastic
MapReduce
Amazon Machine Learning
Amazon ElastiCache
![Page 13: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/13.jpg)
Metadata lakeUsed for summary statistics and data Classification managementSimplified model for data discovery & governance
Catalogue & search
![Page 14: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/14.jpg)
Catalogue & Search Architecture
![Page 15: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/15.jpg)
Encryption for Data protectionAuthentication & AuthorizationAccess Control & Restrictions
Entitlements
![Page 16: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/16.jpg)
Data Protection via EncryptionAWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM DeviceCommon Criteria EAL4+, NIST FIPS 140-2
AWS Key Management ServiceAutomated key rotation & auditingIntegration with other AWS services
AWS server side encryptionAWS managed key infrastructure
![Page 17: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/17.jpg)
Entitlements – Access to Encryption Keys
Customer Master Key
Customer Data Keys
CiphertextKey
PlaintextKey
IAM TemporaryCredential
Security Token Service
MyData
MyData
S3
S3 Object…
Name: MyDataKey: Ciphertext
Key…
![Page 18: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/18.jpg)
Exposes the data lake to customersProgrammatically query catalogueExpose search APIEnsures that entitlements are respected
API & UI
![Page 19: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/19.jpg)
API & UI Architecture
API Gateway
UI - Elastic Beanstalk
AWS Lambda Metadata IndexUsersIAM
TVM - Elastic Beanstalk
![Page 20: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/20.jpg)
Putting It All Together
![Page 21: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/21.jpg)
Amazon Kinesis
Amazon S3 Amazon Glacier
IAM
Encrypted Data
Security Token Service
AWS Lambda
SearchIndex
Metadata Index
API GatewayUsers UI - Elastic Beanstalk
KMS
Collect & Store
Catalogue & Search
Entitlements & Access Controls
APIs & UI
![Page 22: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/22.jpg)
Amazon S3 - Foundation for your Data Lake
![Page 23: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/23.jpg)
Designed for 11 9s of durability
Designed for 99.99% availability
Durable Available High performance Multiple upload Range GET
Store as much as you need Scale storage and compute
independently No minimum usage commitments
Scalable AWS Elastic MapReduce Amazon Redshift Amazon DynamoDB
Integrated Simple REST API AWS SDKs Read-after-create consistency Event Notification Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
![Page 24: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/24.jpg)
Why Amazon S3 for Data Lake?
Natively supported by frameworks like — Spark, Hive, Presto, etc.
Can run transient Hadoop clusters
Multiple clusters can use the same data
Highly durable, available, and scalable
Low Cost: S3 Standard starts at $0.0275 per GB per month
![Page 25: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/25.jpg)
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis Firehose
S3 Transfer Acceleration
AWS Storage Gateway
Data Ingestion into Amazon S3
![Page 26: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/26.jpg)
Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
![Page 27: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/27.jpg)
Encryption ComplianceSecurity
Identity and Access Management (IAM) policies
Bucket policies Access Control Lists (ACLs) Query string authentication
SSL endpoints Server Side Encryption
(SSE-S3) Server Side Encryption
with provided keys (SSE-C, SSE-KMS)
Client-side Encryption
Buckets access logs Lifecycle Management
Policies Access Control Lists
(ACLs) Versioning & MFA
deletes Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls
![Page 28: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/28.jpg)
Use Case
We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline.
“
”Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Eva TseDirector, Big Data Platform
![Page 29: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/29.jpg)
Tip #1: Use versioning
Protects from accidental overwrites and deletes
New version with every upload
Easy retrieval of deleted objects and roll back to previous versions
Versioning
![Page 30: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/30.jpg)
Tip #2: Use lifecycle policies
Automatic tiering and cost controls Includes two possible actions:
Transition: archives to Standard - IA or Amazon Glacier based on object age you specified
Expiration: deletes objects after specified time Actions can be combined Set policies at the bucket or prefix level Set policies for current version or non-
current versionsLifecycle policies
![Page 31: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/31.jpg)
Versioning
Lifecyclepolicies
Recycle bin
Automaticcleaning
Versioning + lifecycle policies
![Page 32: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/32.jpg)
Expired object delete marker policy
Deleting a versioned object makes a delete marker the current version of the object
Removing expired object delete marker can improve list performance
Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist
Expired object delete marker
![Page 33: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/33.jpg)
Insert console screen shot
Enable policy with the console
![Page 34: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/34.jpg)
Incomplete multipart upload expiration policy
Partial upload does incur storage charges Set a lifecycle policy to automatically make
incomplete multipart uploads expire after a predefined number of days
Incomplete multipart upload expiration
Best Practice
![Page 35: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/35.jpg)
Enable policy with the Management Console
![Page 36: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/36.jpg)
Considerations for organizing your Data Lake
Amazon S3 storage uses a flat keyspace
Separate data by business unit, application, type, and time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly
![Page 37: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/37.jpg)
Best Practices for your Data Lake
Always store a copy of raw input as the first rule of thumb
Use automation with S3 Events to enable trigger based workflows
Use a format that supports your data, rather than force your data into a format
Apply compression everywhere to reduce the network load
![Page 38: Building a Data Lake on AWS](https://reader035.vdocuments.net/reader035/viewer/2022081520/586fb39b1a28abe57d8b6d47/html5/thumbnails/38.jpg)
Thank you!