best practices for supercharging cloud analytics on amazon redshift

33
1 Best Practices for Supercharging Cloud Analytics on Amazon Redshift Tina Adams, Amazon Redshift Brandon Davis, Cervello Maneesh Joshi, SnapLogic May 2014

Upload: snaplogic-inc

Post on 15-Jan-2015

1.992 views

Category:

Technology


1 download

DESCRIPTION

In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations. To learn more, visit: http://www.snaplogic.com/redshift-trial

TRANSCRIPT

Page 1: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

1

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftTina Adams, Amazon RedshiftBrandon Davis, CervelloManeesh Joshi, SnapLogic

May 2014

Page 2: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

2

Featured Speakers

Page 3: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

3

Agenda

• Amazon Redshift Feature and Market Update

• SnapLogic Case Studies with Amazon Redshift

• Demo: SnapLogic Free Trial for Amazon Redshift and RDS

• Cervello: Implementation Best Practices

Page 4: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

4

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Page 5: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

5

Amazon Redshift Architecture

• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution

• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via

Amazon S3; load from Amazon DynamoDB or SSH

• Two hardware platforms– Optimized for data processing– DW1: HDD; scale from 2TB to 1.6PB– DW2: SSD; scale from 160GB to

256TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 6: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

6

Amazon Redshift is priced to let you analyze all your data

• Number of nodes x cost per hr

• No charge for leader node

• No upfront costs

• Pay as you go

DW1 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB

On-Demand $ 0.850 $ 3,723

1 Year Reservation

$ 0.500 $ 2,190

3 Year Reservation

$ 0.228 $ 999

DW2 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB

On-Demand $ 0.250 $ 13,688

1 Year Reservation

$ 0.161 $ 8,794

3 Year Reservation

$ 0.100 $ 5,498

Page 7: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

7

Amazon Redshift Feature Delivery

-60

40

-30

Page 8: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

8

Improved Concurrency

Before15

After50

Page 9: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

9

COPY from JSON

{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] }

COPY venue FROM 's3://mybucket/venue.json' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' JSON AS 's3://mybucket/venue_jsonpaths.json';

Page 10: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

10

COPY from Amazon Elastic MapReduce

COPY sales From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials ‘aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';

Amazon EMR Amazon Redshift

Page 11: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

11

REGEX_SUBSTR()

select email, regexp_substr(email,'@[^.]*') from users limit 5;

email | regexp_substr --------------------------------------------+---------------- [email protected] | @nonnisiAenean [email protected] | @lacusUtnec [email protected] | @semperpretiumneque [email protected] | @tristiquealiquet [email protected] | @sodalesat

Page 12: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

12

Resize Progress

• Progress indicator in console

• New API call

Page 13: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

13

ECDHE cipher suites for perfect forward security over SSL

ECDHE-RSA & ECDHE-ECDCSA cipher suites supported

Page 14: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

14

Amazon Redshift integrates with multiple data sources

Amazon S3 Amazon EMR

Amazon Redshift

DynamoDB

Amazon RDS

Corporate Datacenter

Page 15: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

15

Agenda

• Amazon Redshift Feature and Market Update

• SnapLogic Case Studies with Amazon Redshift

• Demo: SnapLogic Free Trial for Amazon Redshift and RDS

• Cervello: Implementation Best Practices

Page 16: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

16

The SnapLogic Platform for Elastic Integration Powering Analytics, Apps and APIs

Data Applications APIs

Page 17: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

17

Why SnapLogic?

Multi-Point Orchestration

• SnapStore: 160+ Prebuilt Snaps

• Orchestration & Workflow

Modern Platform• Elastic, Scale-out

Architecture• Hybrid: Cloud to Cloud and

Cloud to Ground Use Cases

Faster Integration• Easily Design, Monitor,

Manage • Deploy in Days not Months

Page 18: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

18

Multi-Point: Comprehensive ConnectivitySnap your Apps: 160+ pre-built integrations

Page 19: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

19

Software-defined Integration

Metadata

Data

• Streams: No data is stored/cached

• Secure: 100% standards-based

• Elastic: Scales out & handles data, app, API integration use cases

Hybrid Scale-out Architecture Respects Data Gravity

Page 20: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

20

International Hotel Chain Reservation Data Mgmt.

• 126 TB of hotel reservation data

• Prohibitive cost-per-query for analytics

• Unacceptable performance

PAST PRESENT

• FedEx’ed 126 TB of data to load into AWS Redshift

• Now run daily sync between on-premise and cloud with SnapLogic of data changes (100-150GB)

• Enrich analytics with Twitter and Travelocity data

• Improved cost-per-query and performance

Page 21: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

21

Mid-sized Pharma Creates Cloud Data Mart

Cloud to On-prem Snaplex

REST

Cloud to Cloud Snaplex

Metadata

Data

• Consolidate DBs (Customer, Address, and Order) and SFDC (Contact and Account) into Redshift

• MicroStrategy is the visualization layer

Page 22: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

22

Agenda

• Amazon Redshift Feature and Market Update

• SnapLogic Case Studies with Amazon Redshift

• Demo: SnapLogic Free Trial for Amazon Redshift and RDS

• Cervello: Implementation Best Practices

Page 23: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

23

DEMO

Page 24: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

24

Agenda

• Amazon Redshift Feature and Market Update

• SnapLogic Case Studies with Amazon Redshift

• Demo: SnapLogic Free Trial for Amazon Redshift and RDS

• Cervello: Implementation Best Practices

Page 25: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

25

Enterprise Performance Management

(Finance)

Customer Relationship

Management (Sales &

Marketing)

Data Management

Custom Development

Business Intelligence &

Analytics (IT)

• We have offices in Boston, New York, Dallas and the UK• Offshore development and support teams in Russia and India• We partner with the leading on premise and cloud technology

companies

Advise, Implement, Support

Cervello Helps Clients Win With Data

Page 26: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

26

Implementation Case Study

• Hospitality industry analytics– Detailed transactional data

– Weekly / monthly / yearly trend analysis

– Began with single-node cluster, adding nodes as data volumes grow

Source Data Redshift Analytics

ETL

Page 27: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

27

• Collect external data loads before merging with existing data

• Maintain history of cleansed and standardized source data

• Use data structures optimized for analytics– Dimension and fact tables

for analytics

– Aggregate tables

Best Practice #1: Choose The Right Pattern

• Staging tables

• History tables

• Star schema data warehouse

Requirements Design

Page 28: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

28

Best Practice #2: Select the Right Node Type

• Performance was good with initial volumes and small data sets on single node

• Evaluated dense storage (dw1) and dense compute (dw2) nodes

• More opportunity to optimize design as volumes grew

• Increased nodes to handle larger volumes– Solution leverages dense

storage (dw1) nodes

– Expected to stabilize between 10-20TB

• Have also seen smaller volumes that work really well in dense compute (dw2) nodes

Early Stages Mature Stage

Page 29: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

29

Best Practice #3: Leverage MPP

• Spread data evenly across nodes while also optimizing join performance

• Distribution key and sort keys are primary considerations

Leader Node

Compute Node 1

Compute Node 2

Compute Node n

Compute Node 3

• Initial fact table distribution key caused skewed data

• Changed to dimension foreign key with better distribution for 40%+ improvement in query times

• Surrogate keys on dimension tables– Primary key

– Sort key and distribution key OR distribute to all nodes

– Sort on foreign keys in fact tables

Goals Approach

Page 30: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

30

Best Practice #4: Use Columnar Compression

• Started with compression settings based on general data types– VARCHAR to TEXT255,

INTEGER to MOSTLY16, etc.

– Iterate using ANALYZE COMPRESSION

• Redshift applies automatic compression during COPY– Staging tables

• Reduce I/O workload by minimizing size of data stored on disk

Goals Approach

Page 31: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

31

Best Practice #5: Load and Manage Data

• ETL and ELT– ETL: First set of processes prepares data for analytics –

business logic, standardization, validation

– ELT: Second set of processes load data into Redshift and transform into analytical structures

• Data management– Enforce constraints within ETL processes

– Analyze after loads to update statistics

– Vacuum after large loads to existing tables, updates and deletes

Page 32: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

32

Bringing it All Together

• Analytic queries– Minimize number of query columns to improve

performance

– Most queries use SUM or COUNT

– Leveraging aggregate tables for monthly dashboards

• Explain long running queries to help optimize design– Sorting / merging within nodes and merging at leader

node

Page 33: Best Practices for Supercharging Cloud Analytics on Amazon Redshift

33

Learn more…

1. Try out the SnapLogic Free Trial for Amazon Redshift: http://snaplogic.com/redshift-trial

2. Learn more about Amazon Redshift at:

http://aws.amazon.com/redshift

3. Learn more about Cervello at:

http://mycervello.com/