amazon redshift ssd - queries on tbs of data can run in a few seconds

21
Amazon Redshift SSD - Queries on TBs of data can run in a few seconds FlyData: Amazon Redshift BENCHMARK Series 03

Upload: flydata-inc

Post on 05-Dec-2014

20.628 views

Category:

Technology


0 download

DESCRIPTION

We have run benchmarks to compare Redshift SSD instances to Redshift HDD instances. See our blog at https://flydata.com/blog/posts/with-amazon-redshift-ssd-querying-a-tb-of-data-took-less-than-10-seconds

TRANSCRIPT

Page 1: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

FlyData: Amazon RedshiftBENCHMARK Series 03

Page 2: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Amazon Redshift HDD took 33.32 seconds to run our queries for 300GB dataAmazon Redshift SSD took 4.32 seconds to run our queries for 300GB data

Amazon Redshift SSD performed 8X fasterTakeaways:

•1.2 TB can now be handled in under 10 seconds. •Use cases could spread to ad-delivery optimization and financial trading systems.

Page 3: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Amazon Redshift is a popular data warehouse for big data on the cloud. AWS added the SSD instance type on January 24, 2014.

We have run benchmarks to compare Redshift SSD instances to Redshift HDD instances using the following parameters:• Data Size: 1.2TB and 300GB• Query performance when

querying against all records in the cluster • Loading speed• Cost comparison

Page 4: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

1. Query Speed for similar cluster sizes

• SSD version is faster.

• Query against 1.2TB (entire data set) took less than 10 seconds!

• For 1.2TB of data, comparing similar node sizes:query time: 9.22s (SSD) vs 28.48s (HDD 8XLx2)

• SSD version is faster.

• Query against 1.2TB (entire data set) took less than 10 seconds!

• For 1.2TB of data, comparing similar node sizes:query time: 9.22s (SSD) vs 28.48s (HDD 8XLx2)

* See Appendix for queries being used.

Comparison of query speed against dw1.xlarge (HDD) and dw2.large (SSD) for 1.2TBs of data.

In order of cost

Page 5: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

1. Query Speed at similar pricing points

• Query performance comparison based on similar pricing point.

• 4 nodes of dw2.large cost:$0.25(/hour) * 4(nodes) = $1.00(/hour)

• 1 node of dw1.xlarge cost:$0.85(/hour)

• Direct comparison is difficult, but we can see much better query performance for the dw2 (SSD) Redshift.

• Query performance comparison based on similar pricing point.

• 4 nodes of dw2.large cost:$0.25(/hour) * 4(nodes) = $1.00(/hour)

• 1 node of dw1.xlarge cost:$0.85(/hour)

• Direct comparison is difficult, but we can see much better query performance for the dw2 (SSD) Redshift.

* See Appendix for queries being used.

Comparison of query speed for cluster configurations with similar pricing for 300GB of data.

Page 6: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

2. Loading Time

• For similar cost (DW2:$1.00/hour vs DW1:$0.85/hour), loading time was 4.6x faster on SSD.

• For similar node sizes (DW2:12 nodes vs DW1:16 nodes), loading time was 1.65x faster on SSD.

• For similar cost (DW2:$1.00/hour vs DW1:$0.85/hour), loading time was 4.6x faster on SSD.

• For similar node sizes (DW2:12 nodes vs DW1:16 nodes), loading time was 1.65x faster on SSD.

* See Appendix for queries being used.

Similar Cost Similar Node Count

Page 7: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

7

DW2 Cheaper when data < 0.48TB

TB

3. CostPricing Ondemand RI1 RI3

Hourly Upfront Hourly Upfront Hourly

dw1 $0.85 $2500 $0.215 $3000 $0.114

dw2 $0.25 $750 $0.075 $1325 $0.05

Page 8: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Summary

• Consider DW2 SSD Redshift– If Query and Loading Performance is primary

and cost considerations are secondary– If your data is smaller than 0.48TBs

• Consider DW1 HDD Redshift– If current DW1 Redshift performance is

sufficient– If DW2 costs are too expensive for your use

case

Page 9: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

About Us - FlyData• FlyData Enterprise

– Enables continuous loading to Amazon Redshift, with real-time data loading

– Automated ETL process with multiple supported data formats

– Auto scaling, data Integrity and high durability

– FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift

Contact us at: [email protected] are an official data integration partner of Amazon Redshift

Page 10: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

APPENDIX

Page 11: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Appendix: Data Loaded for Testing

TSV files, gzip compressed

Imp_log

1) 300GB / 300M record2) 1.2TB / 1.2B record

date datetimepublisher_id integerad_campaign_id integerbid_price realcountry varchar(30)attr1-4 varchar(255)

click_log

1) 1.4GB / 1.5M record2) 5.6GB / 6M record

date datetimepublisher_id integerad_campaign_id integercountry varchar(30)attr1-4 varchar(255)

1) for 1 month2) for 4 months

ad_campaign

100MB / 100k record

publisher

10MB / 10k record

advertiser

10MB / 10k record

We used 5 tables to run a query which joins tables and creates a report.

Page 12: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Appendix: Sample Query

select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPMfrom ad_campaigns acjoin advertisers adv on (ac.advertiser_id = adv.advertiser_id)

join(select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id) ims on (ims.ad_campaign_id = ac.ad_campaign_id)join(select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id) cs on (cs.ad_campaign_id = ac.ad_campaign_id);

The query generates a basic report for ad campaigns performance, imp, click numbers,advertiser spending, CTR, CPC and CPM. The query runs against all data in the cluster.

Page 13: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Query Performance: Data Size = 1.2 TB

Query Process time(1.2TB) 12x DW2.large 1x DW1.xlarge

2x DW1.xlarge

2x DW1.8xlarge

trial Sample Query Sample QuerySample Query

Sample Query

1 15.3 163.85 61.44 39.11 ignore

2 8.8 148.65 52.89 26.77

3 9.71 157.65 53.76 29.9

4 9.12 155.91 53.52 27.51

5 9.24 149.04 52.22 29.75

average 9.2175 155.02 53.0975 28.4825

(In seconds)

Page 14: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Query Performance: Data Size = 300GB

Query Process time(300GB) 4x DW2.large 1x DW1.xlarge

trial Sample Query Sample Query

1 9.05 58ignore

2 4.31 42.69

3 4.65 30.84

4 4.13 30.14

5 4.17 29.6

average 4.315 33.3175

(In seconds)

Page 15: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Appendix: Additional Information

• All resources for our benchmark are on our github repository– https://github.com/hapyrus/redshift-

benchmark– The dataset we use is open on S3, so you

can reproduce the benchmark

Page 16: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Summary: Amazon Redshift Pricing

• DW1: Amazon Redshift (HHD)

• DW2: Amazon Redshift (SSD)– Cost is around 4x more expensive– If storage need is less than 0.48TB, then DW2

is cheaper

16

Page 17: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Cost comparison:1XL of DW1 (2TB), 4XL of DW2 (0.64TB) and 12XL of DW2 (1.92TB)

17

Page 18: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

18

x

x

For the same storage space, DW2 SSD can be 5.2 times higher

Page 19: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

19

Page 20: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

20

Page 21: Amazon Redshift SSD - Queries on TBs of data can run in a few seconds

Additional Comments

• SSD could be 3.5x ~ 5x more expensive than HDD for the same amount of storage space (SSD is really optimized for performance)

• DW1.8xlarge is exactly 8 times a DW1.xlarge, but DW2.8xlarge is actually 16 times a DW2.large. This is because DW2.large nodes are not “xlarge”; a bit confusing… ;)

(as of Jan. 27, 2014)