amazon redshift ssd - queries on tbs of data can run in a few seconds
DESCRIPTION
We have run benchmarks to compare Redshift SSD instances to Redshift HDD instances. See our blog at https://flydata.com/blog/posts/with-amazon-redshift-ssd-querying-a-tb-of-data-took-less-than-10-secondsTRANSCRIPT
Amazon Redshift SSD - Queries on TBs of data can run in a few seconds
FlyData: Amazon RedshiftBENCHMARK Series 03
Amazon Redshift HDD took 33.32 seconds to run our queries for 300GB dataAmazon Redshift SSD took 4.32 seconds to run our queries for 300GB data
Amazon Redshift SSD performed 8X fasterTakeaways:
•1.2 TB can now be handled in under 10 seconds. •Use cases could spread to ad-delivery optimization and financial trading systems.
Amazon Redshift is a popular data warehouse for big data on the cloud. AWS added the SSD instance type on January 24, 2014.
We have run benchmarks to compare Redshift SSD instances to Redshift HDD instances using the following parameters:• Data Size: 1.2TB and 300GB• Query performance when
querying against all records in the cluster • Loading speed• Cost comparison
1. Query Speed for similar cluster sizes
• SSD version is faster.
• Query against 1.2TB (entire data set) took less than 10 seconds!
• For 1.2TB of data, comparing similar node sizes:query time: 9.22s (SSD) vs 28.48s (HDD 8XLx2)
• SSD version is faster.
• Query against 1.2TB (entire data set) took less than 10 seconds!
• For 1.2TB of data, comparing similar node sizes:query time: 9.22s (SSD) vs 28.48s (HDD 8XLx2)
* See Appendix for queries being used.
Comparison of query speed against dw1.xlarge (HDD) and dw2.large (SSD) for 1.2TBs of data.
In order of cost
1. Query Speed at similar pricing points
• Query performance comparison based on similar pricing point.
• 4 nodes of dw2.large cost:$0.25(/hour) * 4(nodes) = $1.00(/hour)
• 1 node of dw1.xlarge cost:$0.85(/hour)
• Direct comparison is difficult, but we can see much better query performance for the dw2 (SSD) Redshift.
• Query performance comparison based on similar pricing point.
• 4 nodes of dw2.large cost:$0.25(/hour) * 4(nodes) = $1.00(/hour)
• 1 node of dw1.xlarge cost:$0.85(/hour)
• Direct comparison is difficult, but we can see much better query performance for the dw2 (SSD) Redshift.
* See Appendix for queries being used.
Comparison of query speed for cluster configurations with similar pricing for 300GB of data.
2. Loading Time
• For similar cost (DW2:$1.00/hour vs DW1:$0.85/hour), loading time was 4.6x faster on SSD.
• For similar node sizes (DW2:12 nodes vs DW1:16 nodes), loading time was 1.65x faster on SSD.
• For similar cost (DW2:$1.00/hour vs DW1:$0.85/hour), loading time was 4.6x faster on SSD.
• For similar node sizes (DW2:12 nodes vs DW1:16 nodes), loading time was 1.65x faster on SSD.
* See Appendix for queries being used.
Similar Cost Similar Node Count
7
DW2 Cheaper when data < 0.48TB
TB
3. CostPricing Ondemand RI1 RI3
Hourly Upfront Hourly Upfront Hourly
dw1 $0.85 $2500 $0.215 $3000 $0.114
dw2 $0.25 $750 $0.075 $1325 $0.05
Summary
• Consider DW2 SSD Redshift– If Query and Loading Performance is primary
and cost considerations are secondary– If your data is smaller than 0.48TBs
• Consider DW1 HDD Redshift– If current DW1 Redshift performance is
sufficient– If DW2 costs are too expensive for your use
case
About Us - FlyData• FlyData Enterprise
– Enables continuous loading to Amazon Redshift, with real-time data loading
– Automated ETL process with multiple supported data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift
Contact us at: [email protected] are an official data integration partner of Amazon Redshift
APPENDIX
Appendix: Data Loaded for Testing
TSV files, gzip compressed
Imp_log
1) 300GB / 300M record2) 1.2TB / 1.2B record
date datetimepublisher_id integerad_campaign_id integerbid_price realcountry varchar(30)attr1-4 varchar(255)
click_log
1) 1.4GB / 1.5M record2) 5.6GB / 6M record
date datetimepublisher_id integerad_campaign_id integercountry varchar(30)attr1-4 varchar(255)
1) for 1 month2) for 4 months
ad_campaign
100MB / 100k record
publisher
10MB / 10k record
advertiser
10MB / 10k record
We used 5 tables to run a query which joins tables and creates a report.
Appendix: Sample Query
select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPMfrom ad_campaigns acjoin advertisers adv on (ac.advertiser_id = adv.advertiser_id)
join(select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id) ims on (ims.ad_campaign_id = ac.ad_campaign_id)join(select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click numbers,advertiser spending, CTR, CPC and CPM. The query runs against all data in the cluster.
Query Performance: Data Size = 1.2 TB
Query Process time(1.2TB) 12x DW2.large 1x DW1.xlarge
2x DW1.xlarge
2x DW1.8xlarge
trial Sample Query Sample QuerySample Query
Sample Query
1 15.3 163.85 61.44 39.11 ignore
2 8.8 148.65 52.89 26.77
3 9.71 157.65 53.76 29.9
4 9.12 155.91 53.52 27.51
5 9.24 149.04 52.22 29.75
average 9.2175 155.02 53.0975 28.4825
(In seconds)
Query Performance: Data Size = 300GB
Query Process time(300GB) 4x DW2.large 1x DW1.xlarge
trial Sample Query Sample Query
1 9.05 58ignore
2 4.31 42.69
3 4.65 30.84
4 4.13 30.14
5 4.17 29.6
average 4.315 33.3175
(In seconds)
Appendix: Additional Information
• All resources for our benchmark are on our github repository– https://github.com/hapyrus/redshift-
benchmark– The dataset we use is open on S3, so you
can reproduce the benchmark
Summary: Amazon Redshift Pricing
• DW1: Amazon Redshift (HHD)
• DW2: Amazon Redshift (SSD)– Cost is around 4x more expensive– If storage need is less than 0.48TB, then DW2
is cheaper
16
Cost comparison:1XL of DW1 (2TB), 4XL of DW2 (0.64TB) and 12XL of DW2 (1.92TB)
17
18
x
x
For the same storage space, DW2 SSD can be 5.2 times higher
19
20
Additional Comments
• SSD could be 3.5x ~ 5x more expensive than HDD for the same amount of storage space (SSD is really optimized for performance)
• DW1.8xlarge is exactly 8 times a DW1.xlarge, but DW2.8xlarge is actually 16 times a DW2.large. This is because DW2.large nodes are not “xlarge”; a bit confusing… ;)
(as of Jan. 27, 2014)