amazon athena capabilities and use cases overview
TRANSCRIPT
Amazon Athena
Prajakta Damle, Roy Hasson and Abhishek Sinha
Amazon Athena
Prajakta Damle, Roy Hasson and Abhishek Sinha
What to Expect from the Session
1. Product walk-through of Amazon Athena and AWS Glue
2. Top-3 use-cases
3. Demos
4. Hands-on Labs http://amazonathenahandson.s3-
website-us-east-1.amazonaws.com/
5. Q&A
Emerging Analytics Architecture
Emerging Analytics Architecture
Serverless
Compute
Storage
VisualizationAmazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Data
ProcessingAmazon AI
ML/DL ServicesAmazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data Warehousing
Amazon Elasticsearch
Real-time log analytics & search
AWS Glue
ETL & Data Catalog
Amazon Kinesis Firehose
Real-Time Data Streaming
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore
IngestionAWS SnowBall
Petabyte-scale Data Import
AWS SnowMobile
Exabyte-scale Data Import
Amazon Kinesis
Streaming Ingestion
Building a Data Lake on AWS
Kinesis FirehoseAthena
Query Service
Amazon Glue
Emerging Analytics Architecture
Serverless
Compute
Storage
VisualizationAmazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Data
ProcessingAmazon AI
ML/DL ServicesAmazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data Warehousing
Amazon Elasticsearch
Real-time log analytics & search
AWS Glue
ETL & Data Catalog
Amazon Kinesis Firehose
Real-Time Data Streaming
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore
IngestionAWS SnowBall
Petabyte-scale Data Import
AWS SnowMobile
Exabyte-scale Data Import
Amazon Kinesis
Streaming Ingestion
Amazon Athena (launched at re:Invent 2016)
• Serverless query service for querying data in S3 using standard SQL,
with no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and
window functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet; can query data encrypted with KMS
• $5/TB scanned. Pay per query only when you’re running queries based
on data scanned. If you compress your data, you pay less and your
queries run faster
Amazon
Athena
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Amazon Athena is Easy To Use
• Console
• JDBC
• BI Tools
• Query Builders
• API
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the best performance and lowest cost
• Stream data from directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested queries & window functions
• Support for complex data types (arrays, structs)
• Support for partitioning of data by any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or Customer Key, Date
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Amazon Athena Supports Multiple Data Formats
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Apache Parquet & Apache ORC
• AVRO
Emerging Analytics Architecture
Serverless
Compute
Storage
VisualizationAmazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Data
ProcessingAmazon AI
ML/DL ServicesAmazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data Warehousing
Amazon Elasticsearch
Real-time log analytics & search
AWS Glue
ETL & Data Catalog
Amazon Kinesis Firehose
Real-Time Data Streaming
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore
IngestionAWS SnowBall
Petabyte-scale Data Import
AWS SnowMobile
Exabyte-scale Data Import
Amazon Kinesis
Streaming Ingestion
AWS Glue automates
the undifferentiated heavy lifting of ETL
Automatically discover and categorize your data making it immediately
searchable and queryable across data sources
Generate code to clean, enrich, and reliably move data between various data
sources; you can also use their favorite tools to build ETL jobs
Run your jobs on a serverless, fully managed, scale-out environment. No
compute resources to provision or manage.
Discover
Develop
Deploy
AWS Glue: Components
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extracts metadata and creates tables
Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
Run jobs on a serverless Spark platform
Provides flexible scheduling
Handles dependency resolution, monitoring and alerting
Job Authoring
Auto-generates ETL code
Build on open frameworks – Python and Spark
Developer-centric – editing, debugging, sharing
Use Case 1: Ad-hoc analysis
Ad-hoc querying
S3 Athena
ad-hoc
query
Ad-hoc queries
EMR
AthenaETL/
Analytics
ad-hoc query
Quicksight
S3JDBC
Adhoc analysis on historical data
Redshift
EMR Cluster
AthenaS3
ETL
ad-hoc query
copy
Kinesis
Stream
firehose
Creating Databases and Tables
1. Use the Hive DDL statement directly from the console
or JDBC
2. Use the Crawlers
3. Use the API
Creating Tables - Concepts
• Create Table Statements (or DDL) are written in Hive
• High degree of flexibility
• Schema on Read
• Hive is SQL like but allows other concepts such “external
tables” and partitioning of data
• Data formats supported – JSON, TXT, CSV, TSV, Parquet
and ORC (via Serdes)
• Data in stored in Amazon S3
• Metadata is stored in an a Athena Data Catalog or Glue Data
Catalog
Schema on Read Versus Schema on Write
Schema on write
Create a Schema
Transform your data into the
schema Load the data and
Query the data
Good for repetition and
performance
Schema on Read
Store raw data Create
schema on the fly
Good for exploration, and
experimentation
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
External = creates a view of this data.
When you delete the table, the data is
not deleted
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Location = where data is stored.
In Athena this is mandated to be
in Amazon S3
Creating Tables – Parquet
CREATE EXTERNAL TABLE db_name.taxi_rides_parquet (
vendorid STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
ratecode INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
total_amount DOUBLE,
payment_type INT
)
PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)
STORED AS PARQUET
LOCATION 's3://serverless-analytics/canonical/NY-Pub’
TBLPROPERTIES ('has_encrypted_data'=’true');
Creating Tables – Nested JSON
CREATE EXTERNAL TABLE IF NOT EXISTS fix_messages (
`bodyLength` int,
`defaultAppVerID` string,
`encryptMethod` int,
`msgSeqNum` int,
`msgType` string,
`resetSeqNumFlag` string,
`securityRequestID` string,
`securityRequestResult` int,
`securityXML` struct <version:int, header:struct<assetClass:string, tierLevelISIN:int, useCase:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://my_bucket/fix/'
TBLPROPERTIES ('has_encrypted_data'='false');
CSV SerDeROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'colelction.delim' = '|',
'mapkey.delim' = ':',
'escape.delim' = '\\’ )
Does not support removing quote characters from fields. But different primitive types
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "`",
"escapeChar" = "\\” )
All fields must be defined as type String
Grok Serde
CREATE EXTERNAL TABLE `mygroktable`(
'SYSLOGBASE' string,
'queue_id' string,
'syslog_message' string
)
ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{10,11}',
'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/groksample';
Athena Cookbook
AWS Big Data Blog
Demo: 1
Data Catalog & Crawlers
Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Glue Data Catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
Search over metadata for data discovery
Connection info – JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other
metadata are updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.
Glue Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
Data Catalog: Version control
List of table versionsCompare schema versions
Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
Data Catalog: Automatic partition detection
Automatically register available partitions
Table
partitions
Running Queries is Simple
Run time
and data
scanned
Demo 2
Use Case 2: Data Lake Analytics
Data Lake Architecture
How is it different
Layout data to allow for better performance and lower cost
1. Conversion of data to columnar formats
2. Data partitioning
3. Optimizing for file sizes
4. Customers may also do ETL to clean up the data or
create views
Benefits of conversion
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Job authoring in AWS Glue
Python code generated by AWS Glue
Connect a notebook or IDE to AWS Glue
Existing code brought into AWS Glue
You have choices
on how to get
started
You can use Glue for data conversion and ETL
1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
Human-readable, editable, and portable PySpark code
Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C
Job authoring: Relationalize() transform
Semi-structured schema Relational schema
F
KA B B C.X C.
Y
P
KValu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
Job authoring: Glue transformations
Prebuilt transformation: Click and
add to your job with simple
configuration
Spigot writes sample data from
DynamicFrame to S3 in JSON format
Expanding… more transformations
to come
Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to a Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors
Job Authoring: Leveraging the community
No need to start from scratch.
Use Glue samples stored in Github to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive Metastore data
into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
Glue’s Python ETL library
Download Glue’s Python ETL library to start developing
code in your IDE: https://github.com/awslabs/aws-glue-libs
Data Partitioning – Benefits
• Separates data files by any column
• Read only files the query needs
• Reduce amount of data scanned
• Increase query completion time
• Reduce query cost
Data Partitioning – S3
• Prefer Hive compatible partition naming
• [column_name = column_value]
• i.e. s3://athena-examples/logs/year=2017/month=5/
• Support simple partition naming
• i.e. s3://athena-examples/logs/2017/5/
Data Partitioning – Data Catalog
ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01')
location 's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’
ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01')
location 's3://athena-examples/elb/plaintext/2015/01/01/’
ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'),
PARTITION (dt='2014-05-15’,country='IN’)
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION
's3://athena-examples/new_customers/zip=98040/state=WA’
File sizes matter
Data Lake Architecture
Use Case 3: Embedding Athena
Athena API
• Asynchronous interaction model
• Initiate a query, get query ID, retrieve results
• Named queries
• Save queries and reuse
• Paginated result set
• Max page size current at 1000
• Column data and metadata
• Name, type, precision, nullable
• Query status
• State, start and end times
• Query statistics
• Data scanned and execution time
Hands On Labs
• http://amazonathenahandson.s3-website-us-
east-1.amazonaws.com/
Thank You!