introduction to amazon redshift and what's next (dat103) | aws re:invent 2013

Post on 11-May-2015

2.021 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that costs less than $1,000 per terabyte per year—less than a tenth the price of most traditional data warehousing solutions. In this session, you get an overview of Amazon Redshift, including how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Finally, we announce new features that we've been working on over the past few months.

TRANSCRIPT

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Redshift Overview & What’s Next

Rahul Pathak, Redshift PM (rapathak@amazon.com) Anurag Gupta, Redshift GM (awgupta@amazon.com)

November 13, 2013

Amazon Redshift

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• With row storage you do unnecessary I/O

• To get total amount, you have to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With column storage, you only read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Slides not intended for redistribution.

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• COPY compresses automatically on load

• You can analyze and override

• More performance, less cost

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and maximum value for each block

• Skip over blocks that don’t contain relevant data

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

DW.HS1.8XL:

• > 2 GB/s scan rate

• Optimized for data processing

• High disk density

DW.HS1.XL:

Amazon Redshift architecture

• Leader Node – SQL endpoint – Stores metadata – Coordinates query execution

• Compute Nodes

– Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB

• Single node version available

10 GigE (HPC)

Ingestion Backup Restore

JDBC/ODBC

Amazon Redshift parallelizes and distributes everything

• Load

• Backup/Restore

• Resize

• Load in parallel from Amazon S3 or Amazon DynamoDB

• Data automatically distributed and sorted according to DDL

• Scales linearly with number of nodes

Amazon Redshift parallelizes and distributes everything

• Load

• Backup/Restore

• Resize

• Backups to Amazon S3 are automatic, continuous and incremental

• Configurable system snapshot retention period

• Take user snapshots on-demand

• Streaming restores enable you to resume querying faster

Amazon Redshift parallelizes and distributes everything

• Load

• Backup/Restore

• Resize

• Resize while remaining online

• Provision a new cluster in the background

• Copy data in parallel from node to node

• Only charged for source cluster

Amazon Redshift parallelizes and distributes everything

• Load

• Backup/Restore

• Resize

• Automatic SQL endpoint switchover via DNS

• Decommission the source cluster

• Simple operation via Console or API

Amazon Redshift parallelizes and distributes everything

• Load

• Backup/Restore

• Resize

Amazon Redshift lets you start small and grow big

Extra Large Node (DW.HS1.XL) 3 spindles, 2 TB, 16 GB RAM, 2 cores Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (DW.HS1.8XL) 24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE Cluster 2-100 Nodes (32 TB – 1.6 PB)

Note: Nodes not to scale

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go

Amazon Redshift has security built in • SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated – All blocks on disk and in Amazon S3

encrypted

• No direct access to compute nodes

• Amazon VPC support

10 GigE (HPC)

Ingestion Backup Restore

Customer VPC

Internal VPC

JDBC/ODBC

Amazon Redshift automatically manages data replication and hardware failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental – Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and

nodes

• Able to restore snapshots to any Availability Zone within a region

Growing ecosystem

AWS Marketplace • Find software to use with

Amazon Redshift

• One-click deployments

• Flexible pricing options

http://aws.amazon.com/marketplace

Over 40 new features since launch on Feb 14 • Regions

– N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney

• Certifications – PCI, SOC 1/2/3

• Security

– Load/unload encrypted files, Resource-level IAM, Temporary credentials

• Manageability – Snapshot sharing, backup/restore progress indicators

• Query

– Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout

• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format

detection, epoch

Amazon Redshift – What’s Next

Security, visibility and control

• Audit logging

• SNS Alerts

Redshift

Visibility and control

• Audit logging

• SNS Alerts

Amazon S3

Amazon Redshift

Database Activity Logins, Login failures,

Queries, Loads

System Activity Creates, Changes,

Deletes, Resizes

AWS CloudTrail

Visibility and control

• Audit logging

• SNS Alerts

Amazon Redshift

SNS Topic

Monitoring Security

Maintenance Errors

Batch operations

• Cluster Creation

• Faster Resize Amazon

Redshift

Amazon S3

Amazon EMR

Amazon EC2

Corporate Data Center

Batch operations

• Cluster Creation

• Faster Resize Amazon

Redshift

Amazon S3

Amazon EMR

Amazon EC2

Corporate Data Center

Batch operations

• Cluster Creation

• Faster Resize

15-20 min

3 min

Batch operations

• Cluster Creation

• Faster Resize

29 hours

7 hours

Performance & Concurrency

Performance & Concurrency

692.8s

34.9s

< 2%

Performance & Concurrency

5,951.7s

2,151.9s

Performance & Concurrency

15

50

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

Unload Encrypted Files

DUB (4/25)

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

4 byte UTF-8 (7/18)

Statement Timeout (7/22)

SHA1 Builtin (7/15)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail,

Concurrency, Resize Perf., Approximate Count Distinct, SNS

Alerts, WLM Memory Management (11/13)

SOC1/2/3 (5/8)

Sharing snapshots (7/18)

Resource Level IAM (8/9)

PCI (8/22)

Feature Delivery

6 weeks left

Redshift Customers at re:Invent BDT 101: Big Data ‘State of the Union’

Earlier today

DAT 305: Getting Maximum Performance from Amazon Redshift Wednesday 11/13: 3pm in Murano 3303

Redshift Customers at re:Invent DAT 306: How Amazon.com is Leveraging Amazon Redshift

Thursday 11/14: 3pm in Murano 3303

DAT 205: Amazon Redshift in Action: Enterprise, Big Data, SaaS Friday 11/15: 9am in Lido 3006

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

DAT 103

top related