amazon athena capabilities and use cases overview

67
Amazon Athena Prajakta Damle, Roy Hasson and Abhishek Sinha

Upload: amazon-web-services

Post on 22-Jan-2018

1.519 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Amazon Athena Capabilities and Use Cases Overview

Amazon Athena

Prajakta Damle, Roy Hasson and Abhishek Sinha

Page 2: Amazon Athena Capabilities and Use Cases Overview

Amazon Athena

Prajakta Damle, Roy Hasson and Abhishek Sinha

Page 3: Amazon Athena Capabilities and Use Cases Overview

What to Expect from the Session

1. Product walk-through of Amazon Athena and AWS Glue

2. Top-3 use-cases

3. Demos

4. Hands-on Labs http://amazonathenahandson.s3-

website-us-east-1.amazonaws.com/

5. Q&A

Page 4: Amazon Athena Capabilities and Use Cases Overview

Emerging Analytics Architecture

Page 5: Amazon Athena Capabilities and Use Cases Overview

Emerging Analytics Architecture

Serverless

Compute

Storage

VisualizationAmazon QuickSight

Fast, easy to use, cloud BI

Analytic Notebooks

Jupyter, Zeppelin, HUE

Data

ProcessingAmazon AI

ML/DL ServicesAmazon Athena

Interactive Query

Amazon EMR

Managed Hadoop & Spark

Amazon Redshift + Spectrum

Petabyte-scale Data Warehousing

Amazon Elasticsearch

Real-time log analytics & search

AWS Glue

ETL & Data Catalog

Amazon Kinesis Firehose

Real-Time Data Streaming

AWS Lambda

Trigger-based Code Execution

Amazon Redshift Spectrum

Fast @ Exabyte scale

Amazon S3

Exabyte-scale Object Storage

AWS Glue Data Catalog

Hive-compatible Metastore

IngestionAWS SnowBall

Petabyte-scale Data Import

AWS SnowMobile

Exabyte-scale Data Import

Amazon Kinesis

Streaming Ingestion

Page 6: Amazon Athena Capabilities and Use Cases Overview

Building a Data Lake on AWS

Kinesis FirehoseAthena

Query Service

Amazon Glue

Page 7: Amazon Athena Capabilities and Use Cases Overview

Emerging Analytics Architecture

Serverless

Compute

Storage

VisualizationAmazon QuickSight

Fast, easy to use, cloud BI

Analytic Notebooks

Jupyter, Zeppelin, HUE

Data

ProcessingAmazon AI

ML/DL ServicesAmazon Athena

Interactive Query

Amazon EMR

Managed Hadoop & Spark

Amazon Redshift + Spectrum

Petabyte-scale Data Warehousing

Amazon Elasticsearch

Real-time log analytics & search

AWS Glue

ETL & Data Catalog

Amazon Kinesis Firehose

Real-Time Data Streaming

AWS Lambda

Trigger-based Code Execution

Amazon Redshift Spectrum

Fast @ Exabyte scale

Amazon S3

Exabyte-scale Object Storage

AWS Glue Data Catalog

Hive-compatible Metastore

IngestionAWS SnowBall

Petabyte-scale Data Import

AWS SnowMobile

Exabyte-scale Data Import

Amazon Kinesis

Streaming Ingestion

Page 8: Amazon Athena Capabilities and Use Cases Overview

Amazon Athena (launched at re:Invent 2016)

• Serverless query service for querying data in S3 using standard SQL,

with no infrastructure to manage

• No data loading required; query directly from Amazon S3

• Use standard ANSI SQL queries with support for joins, JSON, and

window functions

• Support for multiple data formats include text, CSV, TSV, JSON, Avro,

ORC, Parquet; can query data encrypted with KMS

• $5/TB scanned. Pay per query only when you’re running queries based

on data scanned. If you compress your data, you pay less and your

queries run faster

Amazon

Athena

Page 9: Amazon Athena Capabilities and Use Cases Overview

Athena is Serverless

• No Infrastructure or

administration

• Zero Spin up time

• Transparent upgrades

Page 10: Amazon Athena Capabilities and Use Cases Overview

Amazon Athena is Easy To Use

• Console

• JDBC

• BI Tools

• Query Builders

• API

Page 11: Amazon Athena Capabilities and Use Cases Overview

Query Data Directly from Amazon S3

• No loading of data

• Query data in its raw format

• Text, CSV, JSON, weblogs, AWS service logs

• Convert to an optimized form like ORC or Parquet for the best performance and lowest cost

• Stream data from directly from Amazon S3

• Take advantage of Amazon S3 durability and availability

Page 12: Amazon Athena Capabilities and Use Cases Overview

Use ANSI SQL

• Start writing ANSI SQL

• Support for complex joins, nested queries & window functions

• Support for complex data types (arrays, structs)

• Support for partitioning of data by any key

• (date, time, custom keys)

• e.g., Year, Month, Day, Hour or Customer Key, Date

Page 13: Amazon Athena Capabilities and Use Cases Overview

Familiar Technologies Under the Covers

Used for SQL Queries

In-memory distributed query engine

ANSI-SQL compatible with extensions

Used for DDL functionality

Complex data types

Multitude of formats

Supports data partitioning

Page 14: Amazon Athena Capabilities and Use Cases Overview

Amazon Athena Supports Multiple Data Formats

• Text files, e.g., CSV, raw logs

• Apache Web Logs, TSV files

• JSON (simple, nested)

• Compressed files

• Columnar formats such as Apache Parquet & Apache ORC

• AVRO

Page 15: Amazon Athena Capabilities and Use Cases Overview
Page 16: Amazon Athena Capabilities and Use Cases Overview

Emerging Analytics Architecture

Serverless

Compute

Storage

VisualizationAmazon QuickSight

Fast, easy to use, cloud BI

Analytic Notebooks

Jupyter, Zeppelin, HUE

Data

ProcessingAmazon AI

ML/DL ServicesAmazon Athena

Interactive Query

Amazon EMR

Managed Hadoop & Spark

Amazon Redshift + Spectrum

Petabyte-scale Data Warehousing

Amazon Elasticsearch

Real-time log analytics & search

AWS Glue

ETL & Data Catalog

Amazon Kinesis Firehose

Real-Time Data Streaming

AWS Lambda

Trigger-based Code Execution

Amazon Redshift Spectrum

Fast @ Exabyte scale

Amazon S3

Exabyte-scale Object Storage

AWS Glue Data Catalog

Hive-compatible Metastore

IngestionAWS SnowBall

Petabyte-scale Data Import

AWS SnowMobile

Exabyte-scale Data Import

Amazon Kinesis

Streaming Ingestion

Page 17: Amazon Athena Capabilities and Use Cases Overview

AWS Glue automates

the undifferentiated heavy lifting of ETL

Automatically discover and categorize your data making it immediately

searchable and queryable across data sources

Generate code to clean, enrich, and reliably move data between various data

sources; you can also use their favorite tools to build ETL jobs

Run your jobs on a serverless, fully managed, scale-out environment. No

compute resources to provision or manage.

Discover

Develop

Deploy

Page 18: Amazon Athena Capabilities and Use Cases Overview

AWS Glue: Components

Data Catalog

Hive Metastore compatible with enhanced functionality

Crawlers automatically extracts metadata and creates tables

Integrated with Amazon Athena, Amazon Redshift Spectrum

Job Execution

Run jobs on a serverless Spark platform

Provides flexible scheduling

Handles dependency resolution, monitoring and alerting

Job Authoring

Auto-generates ETL code

Build on open frameworks – Python and Spark

Developer-centric – editing, debugging, sharing

Page 19: Amazon Athena Capabilities and Use Cases Overview

Use Case 1: Ad-hoc analysis

Page 20: Amazon Athena Capabilities and Use Cases Overview

Ad-hoc querying

S3 Athena

ad-hoc

query

Page 21: Amazon Athena Capabilities and Use Cases Overview

Ad-hoc queries

EMR

AthenaETL/

Analytics

ad-hoc query

Quicksight

S3JDBC

Page 22: Amazon Athena Capabilities and Use Cases Overview

Adhoc analysis on historical data

Redshift

EMR Cluster

AthenaS3

ETL

ad-hoc query

copy

Kinesis

Stream

firehose

Page 23: Amazon Athena Capabilities and Use Cases Overview

Creating Databases and Tables

1. Use the Hive DDL statement directly from the console

or JDBC

2. Use the Crawlers

3. Use the API

Page 24: Amazon Athena Capabilities and Use Cases Overview

Creating Tables - Concepts

• Create Table Statements (or DDL) are written in Hive

• High degree of flexibility

• Schema on Read

• Hive is SQL like but allows other concepts such “external

tables” and partitioning of data

• Data formats supported – JSON, TXT, CSV, TSV, Parquet

and ORC (via Serdes)

• Data in stored in Amazon S3

• Metadata is stored in an a Athena Data Catalog or Glue Data

Catalog

Page 25: Amazon Athena Capabilities and Use Cases Overview

Schema on Read Versus Schema on Write

Schema on write

Create a Schema

Transform your data into the

schema Load the data and

Query the data

Good for repetition and

performance

Schema on Read

Store raw data Create

schema on the fly

Good for exploration, and

experimentation

Page 26: Amazon Athena Capabilities and Use Cases Overview

Example

CREATE EXTERNAL TABLE access_logs

(

ip_address String,

request_time Timestamp,

request_method String,

request_path String,

request_protocol String,

response_code String,

response_size String,

referrer_host String,

user_agent String

)

PARTITIONED BY (year STRING,month STRING, day STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE

LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'

External = creates a view of this data.

When you delete the table, the data is

not deleted

Page 27: Amazon Athena Capabilities and Use Cases Overview

Example

CREATE EXTERNAL TABLE access_logs

(

ip_address String,

request_time Timestamp,

request_method String,

request_path String,

request_protocol String,

response_code String,

response_size String,

referrer_host String,

user_agent String

)

PARTITIONED BY (year STRING,month STRING, day STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE

LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'

Location = where data is stored.

In Athena this is mandated to be

in Amazon S3

Page 28: Amazon Athena Capabilities and Use Cases Overview

Creating Tables – Parquet

CREATE EXTERNAL TABLE db_name.taxi_rides_parquet (

vendorid STRING,

pickup_datetime TIMESTAMP,

dropoff_datetime TIMESTAMP,

ratecode INT,

passenger_count INT,

trip_distance DOUBLE,

fare_amount DOUBLE,

total_amount DOUBLE,

payment_type INT

)

PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)

STORED AS PARQUET

LOCATION 's3://serverless-analytics/canonical/NY-Pub’

TBLPROPERTIES ('has_encrypted_data'=’true');

Page 29: Amazon Athena Capabilities and Use Cases Overview

Creating Tables – Nested JSON

CREATE EXTERNAL TABLE IF NOT EXISTS fix_messages (

`bodyLength` int,

`defaultAppVerID` string,

`encryptMethod` int,

`msgSeqNum` int,

`msgType` string,

`resetSeqNumFlag` string,

`securityRequestID` string,

`securityRequestResult` int,

`securityXML` struct <version:int, header:struct<assetClass:string, tierLevelISIN:int, useCase:string>>

)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = '1'

) LOCATION 's3://my_bucket/fix/'

TBLPROPERTIES ('has_encrypted_data'='false');

Page 30: Amazon Athena Capabilities and Use Cases Overview

CSV SerDeROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = ',',

'field.delim' = ',',

'colelction.delim' = '|',

'mapkey.delim' = ':',

'escape.delim' = '\\’ )

Does not support removing quote characters from fields. But different primitive types

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = ",",

"quoteChar" = "`",

"escapeChar" = "\\” )

All fields must be defined as type String

Page 31: Amazon Athena Capabilities and Use Cases Overview

Grok Serde

CREATE EXTERNAL TABLE `mygroktable`(

'SYSLOGBASE' string,

'queue_id' string,

'syslog_message' string

)

ROW FORMAT SERDE

'com.amazonaws.glue.serde.GrokSerDe'

WITH SERDEPROPERTIES (

'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{10,11}',

'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'

)

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

's3://mybucket/groksample';

Page 32: Amazon Athena Capabilities and Use Cases Overview

Athena Cookbook

Page 33: Amazon Athena Capabilities and Use Cases Overview

AWS Big Data Blog

Page 34: Amazon Athena Capabilities and Use Cases Overview

Demo: 1

Page 35: Amazon Athena Capabilities and Use Cases Overview

Data Catalog & Crawlers

Page 36: Amazon Athena Capabilities and Use Cases Overview

Glue Data Catalog

Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single

categorized list that is searchable

Page 37: Amazon Athena Capabilities and Use Cases Overview

Glue Data Catalog

Manage table metadata through a Hive metastore API or Hive SQL.

Supported by tools like Hive, Presto, Spark etc.

We added a few extensions:

Search over metadata for data discovery

Connection info – JDBC URLs, credentials

Classification for identifying and parsing files

Versioning of table metadata as schemas evolve and other

metadata are updated

Populate using Hive DDL, bulk import, or automatically through Crawlers.

Page 38: Amazon Athena Capabilities and Use Cases Overview

Glue Data Catalog: Crawlers

Automatically discover new data, extracts schema definitions

• Detect schema changes and version tables

• Detect Hive style partitions on Amazon S3

Built-in classifiers for popular types; custom classifiers using Grok expressions

Run ad hoc or on a schedule; serverless – only pay when crawler runs

Crawlers automatically build your Data Catalog and keep it in sync

Page 39: Amazon Athena Capabilities and Use Cases Overview

Data Catalog: Table details

Table schema

Table properties

Data statistics

Nested fields

Page 40: Amazon Athena Capabilities and Use Cases Overview

Data Catalog: Version control

List of table versionsCompare schema versions

Page 41: Amazon Athena Capabilities and Use Cases Overview

Data Catalog: Detecting partitions

file 1 file N… file 1 file N…

date=10 date=15…

month=Nov

S3 bucket hierarchy Table definition

Estimate schema similarity among files at each level to

handle semi-structured logs, schema evolution…

sim=.99 sim=.95

sim=.93

month

date

col 1

col 2

str

str

int

float

Column Type

Page 42: Amazon Athena Capabilities and Use Cases Overview

Data Catalog: Automatic partition detection

Automatically register available partitions

Table

partitions

Page 43: Amazon Athena Capabilities and Use Cases Overview

Running Queries is Simple

Run time

and data

scanned

Page 44: Amazon Athena Capabilities and Use Cases Overview

Demo 2

Page 45: Amazon Athena Capabilities and Use Cases Overview

Use Case 2: Data Lake Analytics

Page 46: Amazon Athena Capabilities and Use Cases Overview

Data Lake Architecture

Page 47: Amazon Athena Capabilities and Use Cases Overview

How is it different

Layout data to allow for better performance and lower cost

1. Conversion of data to columnar formats

2. Data partitioning

3. Optimizing for file sizes

4. Customers may also do ETL to clean up the data or

create views

Page 48: Amazon Athena Capabilities and Use Cases Overview

Benefits of conversion

• Pay by the amount of data scanned per query

• Ways to save costs

• Compress

• Convert to Columnar format

• Use partitioning

• Free: DDL Queries, Failed Queries

Dataset Size on Amazon S3 Query Run time Data Scanned Cost

Logs stored as Text

files

1 TB 237 seconds 1.15TB $5.75

Logs stored in

Apache Parquet

format*

130 GB 5.13 seconds 2.69 GB $0.013

Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Page 49: Amazon Athena Capabilities and Use Cases Overview

Job authoring in AWS Glue

Python code generated by AWS Glue

Connect a notebook or IDE to AWS Glue

Existing code brought into AWS Glue

You have choices

on how to get

started

You can use Glue for data conversion and ETL

Page 50: Amazon Athena Capabilities and Use Cases Overview

1. Customize the mappings

2. Glue generates transformation graph and Python code

3. Connect your notebook to development endpoints to customize your code

Job authoring: Automatic code generation

Page 51: Amazon Athena Capabilities and Use Cases Overview

Human-readable, editable, and portable PySpark code

Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data

Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries

Collaborative: share code snippets via GitHub, reuse code across jobs

Job authoring: ETL code

Page 52: Amazon Athena Capabilities and Use Cases Overview

Job Authoring: Glue Dynamic Frames

Dynamic frame schema

A C D [ ]

X Y

B1 B2

Like Spark’s Data Frames, but better for:

• Cleaning and (re)-structuring semi-structured

data sets, e.g. JSON, Avro, Apache logs ...

No upfront schema needed:

• Infers schema on-the-fly, enabling transformations

in a single pass

Easy to handle the unexpected:

• Tracks new fields, and inconsistent changing data

types with choices, e.g. integer or string

• Automatically mark and separate error records

Page 53: Amazon Athena Capabilities and Use Cases Overview

Job Authoring: Glue transforms

ResolveChoice() B B B

project

B

cast

B

separate into cols

B B

Apply Mapping() A

X Y

A X Y

Adaptive and flexible

C

Page 54: Amazon Athena Capabilities and Use Cases Overview

Job authoring: Relationalize() transform

Semi-structured schema Relational schema

F

KA B B C.X C.

Y

P

KValu

e

Offs

et

A C D [ ]

X Y

B B

• Transforms and adds new columns, types, and tables on-the-fly

• Tracks keys and foreign keys across runs

• SQL on the relational schema is orders of magnitude faster than JSON processing

Page 55: Amazon Athena Capabilities and Use Cases Overview

Job authoring: Glue transformations

Prebuilt transformation: Click and

add to your job with simple

configuration

Spigot writes sample data from

DynamicFrame to S3 in JSON format

Expanding… more transformations

to come

Page 56: Amazon Athena Capabilities and Use Cases Overview

Job authoring: Write your own scripts

Import custom libraries required by your code

Convert to a Spark Data Frame

for complex SQL-based ETL

Convert back to Glue Dynamic Frame

for semi-structured processing and

AWS Glue connectors

Page 57: Amazon Athena Capabilities and Use Cases Overview

Job Authoring: Leveraging the community

No need to start from scratch.

Use Glue samples stored in Github to share, reuse,

contribute: https://github.com/awslabs/aws-glue-samples

• Migration scripts to import existing Hive Metastore data

into AWS Glue Data Catalog

• Examples of how to use Dynamic Frames and

Relationalize() transform

• Examples of how to use arbitrary PySpark code with

Glue’s Python ETL library

Download Glue’s Python ETL library to start developing

code in your IDE: https://github.com/awslabs/aws-glue-libs

Page 58: Amazon Athena Capabilities and Use Cases Overview

Data Partitioning – Benefits

• Separates data files by any column

• Read only files the query needs

• Reduce amount of data scanned

• Increase query completion time

• Reduce query cost

Page 59: Amazon Athena Capabilities and Use Cases Overview
Page 60: Amazon Athena Capabilities and Use Cases Overview

Data Partitioning – S3

• Prefer Hive compatible partition naming

• [column_name = column_value]

• i.e. s3://athena-examples/logs/year=2017/month=5/

• Support simple partition naming

• i.e. s3://athena-examples/logs/2017/5/

Page 61: Amazon Athena Capabilities and Use Cases Overview

Data Partitioning – Data Catalog

ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01')

location 's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’

ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01')

location 's3://athena-examples/elb/plaintext/2015/01/01/’

ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'),

PARTITION (dt='2014-05-15’,country='IN’)

ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION

's3://athena-examples/new_customers/zip=98040/state=WA’

Page 62: Amazon Athena Capabilities and Use Cases Overview

File sizes matter

Page 63: Amazon Athena Capabilities and Use Cases Overview

Data Lake Architecture

Page 64: Amazon Athena Capabilities and Use Cases Overview

Use Case 3: Embedding Athena

Page 65: Amazon Athena Capabilities and Use Cases Overview

Athena API

• Asynchronous interaction model

• Initiate a query, get query ID, retrieve results

• Named queries

• Save queries and reuse

• Paginated result set

• Max page size current at 1000

• Column data and metadata

• Name, type, precision, nullable

• Query status

• State, start and end times

• Query statistics

• Data scanned and execution time

Page 66: Amazon Athena Capabilities and Use Cases Overview

Hands On Labs

• http://amazonathenahandson.s3-website-us-

east-1.amazonaws.com/

Page 67: Amazon Athena Capabilities and Use Cases Overview

Thank You!