working with the data lake · session’s focus –query the the data lake catalog & search...

42
© 2020, Amazon Web Services, Inc. or its Affiliates. Team or presenters name Date Working With the Data Lake Subtitle

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Team or presenters nameDate

Working With the Data Lake Subtitle

Page 2: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Session’s Focus – Query the The Data Lake

Catalog & Search Access & User Interfaces

Data Ingestion

Analytics & Serving

S3

Amazon DynamoDB

Amazon Elasticsearch Service

AWS AppSync

AmazonAPI Gateway

AmazonCognito

AWS KMS

AWSCloudTrail

Manage & Secure

AWS IAM

Amazon CloudWatch

AWS Snowball

AWS Storage Gateway

Amazon Kinesis Data

Firehose

AWS Direct Connect

AWS Database Migration

Service

AmazonAthena

Amazon EMR

AWS Glue

Amazon Redshift

Amazon DynamoDB

AmazonQuickSight

AmazonKinesis

Amazon Elasticsearch

Service

Amazon Neptune

AmazonRDS

Central StorageScalable, secure, cost-

effective

AWS Glue

AWSDataSync

AWS Transfer for SFTP

Amazon S3 Transfer Acceleration

Page 3: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Table of contents

1. Querying the Data Lake with Amazon Athena

2. Visualizing the Data Lake with Amazon QuickSight

Page 4: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Querying the Data LakeWorking with Amazon Athena

Page 5: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

When would you query the Data Lake?

Examine New Data Sources

Interrogate Logs

Perform Analytics

Explore Relationships Across Sets

Check Data Quality

?

When would you query the Data Lake?

Page 6: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

An interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL

An interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL

Page 7: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Amazon Athena

• Query data in your Amazon S3 based data lake

• Analyze infrastructure, operation, and application logs

• Interactive analytics using popular BI tools

• Self-service data exploration for data scientists

• Embed analytics capabilities into your applications

Page 8: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

What does it look like?What does it look like?

Page 9: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Athena is Serverless

• No Infrastructure or administration

• Zero Spin up time

• Transparent upgrades

Page 10: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Use ANSI SQL

• Support for complex joins, nested queries & window functions

• Support for complex data types (arrays, structs)

• Support for partitioning of data by any key

Use ANSI SQL

Page 11: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Familiar Technologies Under the Covers

Used for SQL QueriesIn-memory distributed query engineANSI-SQL compatible with extensions

Used for DDL functionalityComplex data typesMultitude of formats Supports data partitioning

Page 12: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Amazon Athena is Cost Effective

• Pay per query

• $5 per TB scanned from S3

• DDL Queries and failed queries are free

• Save by using compression, columnar formats, partitions

Page 13: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Athena Workgroups

Athena Workgroups are used to isolate queries between different teams, workloads or applications, and to set lon amount of data

each query or the entire workgroup can process

Workload Isolation Query Metrics Cost Controls

Page 14: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Workgroups – Workload Isolation

Unique query output location per Workgroup

Encrypt results with unique AWS KMS key

per Workgroup

Collect and publish aggregated metrics per

Workgroup to AWS CloudWatch

Use Workgroup settings eliminating need to configure individual users

Page 15: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Workgroups – Metric Reporting

Total bytes scanned per Workgroup

Total failed queries per Workgroup

Total successful queries per Workgroup

Total query execution time per Workgroup

Page 16: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Workgroups – Cost Controls

• Per query data scanned threshold; exceeding, will cancel query

• Trigger alarms to notify of increasing usage and cost• Disable Workgroup when all queries exceed a maximum threshold

Any Athena metric: successful/failed & total queries, query run time, etc.

Page 17: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Workgroups – Usage Notifications

Define a hierarchy of alarms to be alerted as usage increases

Page 18: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Athena support for interface endpoint (PrivateLink)Submit queries securely

• No internet gateway required in your VPC• Secure communication between your VPC and Athena APIs• Set VPC endpoint policies

• Example endpoint policy

Page 19: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Athena support for INSERT INTO

Inserts new rows into a destination table based on a SELECT querystatement that runs on a source table, or based on a setof VALUES provided as part of the statement• Supported Format

Avro, JSON, ORC, Parquet, Text file• Example

Page 20: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Visualizing the Data LakeWorking with Amazon QuickSight

Page 21: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Why Amazon QuickSight

Cloud native = No servers = Auto-ScaleNo servers or software to manage, maintain, deploy. Start with 10s of users and scale to 10s of 1000s

Fully integrated with AWSBuild end-to-end analytics in AWS. Secure private VPC access, fine-grained access control, ML integrations

Secure and globalEnd-to-end encryption. Native High Availability. 10 Global regions. HIPAA, PCI, ISO, SOC and FedRamp eligibility

Customize and embed Embed in applications and enable analytics in hours, not months or years. Use themes to match application/corporate branding

Easy to develop and maintainDesign with Amazon QuickSight, integrate with APIs. Secure data with row-level security and authenticate seamlessly via single sign-on

Fast, consistent performanceFast, predictable performance every time. Concurrent users or increased interactions do not slow down the system

ML insightsContextual, relevant insights with ML-powered anomaly detection, forecasting, alerts and customizable narratives

Insights for everyoneProvide access to all users, pay only for usage. No upfront costs, no charges for inactive users

Page 22: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Connect to your data, wherever it is

QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted databases and third party business applications

On-premisesSecurely connect to on-premise databases and flat files like Excel and CSV

In the cloudConnect to hosted database, big data formats, and secure VPCs

ApplicationsConnect directly to third party business applications

• Salesforce• Square• Adobe Analytics• Jira• ServiceNow• Twitter• Github

• Redshift• RDS• S3• Athena• Aurora• Teradata• MySQL

• Presto• Spark• SQL Server• Postgre SQL• MariaDB• Snowflake• IoT Analytics

• Excel• CSV• Teradata• MySQL• SQL Server• PostgreSQL

QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted databases and third party business applications

Page 23: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

AWS + Amazon QuickSight

Amazon S3 AmazonQuickSight

Create dashboards in minutes

Native fine-grained permissions for users with AWS Identity and Access Management (IAM)

Cost allocation by business unit or team

Amazon Athena

AmazonQuickSight

AmazonRedshift

AmazonRDS

Private VPC connectivity = no public routing of data

Direct query to data sources or SPICE for fast access

AmazonEMR

Pay for what you useCreate dashboards in minutes

Native fine-grained permissions for users with AWS Identity and Access Management (IAM)

Cost allocation by business unit or team

Private VPC connectivity = no public routing of data

Direct query to data sources or SPICE for fast access

Pay for what you use

AmazonRedshift

AmazonEMR

AmazonRDS

Page 24: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Preview dataRename, remove fields, change data typesCreate new calculated fieldsFilter rowsIssue direct query or ingest to SPICEVisually join tables from the same relational DBPush down custom SQL queries

Data Prep

Preview dataRename, remove fields, change data typesCreate new calculated fieldsFilter rowsIssue direct query or ingest to SPICEVisually join tables from the same relational DBPush down custom SQL queries

Page 25: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

SPICE

QuickSight is powered by SPICE, a super-fast calculation engine that delivers performance and scale, regardless of how many users are active.

SPICEYour Data Source

QuickSight is powered by SPICE, a super-fast calculation engine that delivers performance and scale, regardless of how many users are active.

Your Data Source

Page 26: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

SPICESPICE not only provides your users with instant response times, but automatically scales with user activity, protecting your underlying data sources saving you time and money.

Up to 10X faster (millisecond latency)

Availability Zone 2

QuickSight Query Layer

SPICE TABLE SPICE TABLE

S3

SPICE TABLE

Availability Zone 1

S3

Availability Zone 3

S3

Support for high concurrency

Fault-tolerant, self-healing

Instant failover with zero impact

Backed up in S3 (Write Ahead Log)

SPICE not only provides your users with instant response times, but automatically scales with user activity, protecting your underlying data sources saving you time and money.

Page 27: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

User Types / User Roles

AdminManage UsersManage SPICE CapacityManage VPC ConnectionsManage Account Settings

AuthorCreate Data SetsCreate AnalysesCreate Dashboards

ReaderConsume Dashboards

QS AdminSometimes separate from Business Users, sometimes the sameUsually has AWS Console access

Business UserAnyoneCan be internal or external users (customers/partners3rd parties)

AnalystSometimes in IT, sometimes Business Users‘Data Analyst’‘Data Engineer’‘BI Engineer’

Page 28: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Data governance

Create managed datasets that give power users and authors the flexibility to perform self-serve analytics on data that you control.

Create datasets that:

• Can be shared with any user• Automatically refresh• Have row level security• Users cannot modify• Dynamically update

with changes

Page 29: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Differentiate with natural language and ML

Anomaly detectionDiscover unexpected trends and outliers

against millions of business metricsAuto narratives

Summarize your business metrics in plain language

ForecastingMachine learning forecasting with point and click simplicity

ML predictionsVisualize and build

predictive dashboards with Amazon SageMaker models

Page 30: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Anomaly Detection

Page 31: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Forecasting

Page 32: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Natural Language Narratives

Page 33: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

ML insights

Page 34: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Fully interactive with drill down, filtering, & external links

Personalized views with row-level security

No servers to manage, no long-term commitments

Pay for usage with pay-per-session reader pricing

Seamless authentication

Embed Amazon QuickSight Dashboards

Fully interactive with drill down, filtering, & external links

Personalized views with row-level security

No servers to manage, no long-term commitments

Pay for usage with pay-per-session reader pricing

Seamless authentication

Page 35: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Why Amazon QuickSight for embedded dashboards

Contents secured byAD

Users Contents secured byADUsers

Page 36: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

What's new? ThemingYou can now create a collection of themes, and apply a theme to an analysis and all its dashboards

Page 37: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

New APIsData source APIsCreate/manage data source connections w/o sharing credentialsAudit data sources, manage ownership and access

Dataset APIsCreate customized datasets for users or groupsAllows isolation of data, additional row-level security possible

SPICE ingestion APIsTrigger SPICE ingestion of data when ETL/data load is completeEasily review history and trace SPICE ingestion details

Fine grained access control APIsManage user/group access to Amazon S3/Athena data via IAM policies

Template APIsCreate templates from analysisMetadata for dashboard with placeholder for datasets Templates accessible via API onlyCan be copied/referenced across accounts

Dashboard APIsEasily create dashboards per user/group w/different datasets Move dashboards across dev environmentsVersion dashboards for easy rollbacksAudit dashboards, list by users, manage ownership/sharing

Page 38: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Using APIs for your embedded deployments

Data SourceCustomer 2

Customer 3

Customer 1

Application

Identity Store

Personalized, embedded Amazon QuickSight

dashboards for customers/users

Data APIs

User APIs

Your application

AMAZON QUICKSIGHT

Page 39: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

QuickSight Support for Cross Data Source Join

• Join across all data sources supported by QuickSight including file-to-file,file-to-database, and database-to-database joins

Page 40: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Amazon QuickSight: Examples – AWS Cost & Usage Reporting

Page 41: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Amazon QuickSight: Examples – Salesforce Analytics

Page 42: Working With the Data Lake · Session’s Focus –Query the The Data Lake Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving S3 Amazon DynamoDB Amazon Elasticsearch

© 2020, Amazon Web Services, Inc. or its Affiliates.

Q & A