evaluation of cloudera impala 1.1

Download Evaluation of cloudera impala 1.1

Post on 15-Jan-2015

1.619 views

Category:

Technology

0 download

Embed Size (px)

DESCRIPTION

I evaluated impala 1.1 on our cluster environment.

TRANSCRIPT

  • 1. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1Evaluation of Cloudera impala 1.1 Aug 7, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

2. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Sentry support: l Fine-grained authorization l Role-based authorization v Support for views v Performance improvements l Parquet columnar performance l More ecient metadata refresh for larger installations v Additional SQL l SQL-89 joins (in addition to existing SQL-92) l LOAD function l REFRESH command for JDBC/ODBC v Improved Hbase support: l Binary types l Caching conguration v Fixed many bugs Cloudera Impala 1.1 was released !! 2 3. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vHive Impala lOn Impala shell, can read data in VIEW that was created via Hive command ? vImpala Hive lOn Hive shell, can read data in VIEW that was created via Impala command ? vResult Two VIEWs have compatibility Check compatibility of VIEW 3 4. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check performance (Hive on Cluster1) 4 0 50 100 150 200 250 No Comp. Gzip Snappy Gzip Snappy TextFileSequenceFileRCFile 222.039 244.67 239.182 228.801 230.327 Avg. Job Latency [sec] This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of Check performance (Hive on Cluster2). 5. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check performance (Impala on Cluster1) 5 0 50 100 150 200 250 No Comp. Gzip Snappy Gzip Snappy Snappy Text File Sequence FileRCFile Parquet File 23.518 32.155 28.617 20.774 12.654 13.146 Avg. Job Latency [sec] This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of Check performance (Impala on Cluster2). 6. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check performance (Hive on Cluster2) 6 0 50 100 150 200 250 300 No Comp. Gzip Snappy Gzip Snappy TextFileSequenceFileRCFile 272.176 249.531 245.009 230.034 216.802 Avg. Job Latency [sec] 7. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check performance (Impala on Cluster2) 7 0 50 100 150 200 250 300 No Comp. Gzip Snappy Gzip Snappy Snappy Text File Sequence FileRCFile Parquet File 32.528 28.73 21.173 24.794 14.308 19.814 Avg. Job Latency [sec] 8. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vIMPALA-357 lInsert into Parquet exceed mem-limit vProblem lEven if set mem_limit setting, when create ParquetFile table with partitions, consumed memory isnt limited. lAt last, Impalad crashes due to memory shortage vResult CREATE command failed due to memory limit Check xed bug 8 9. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vThanks to dev. team, Impala is also going from Good to Great vBoth VIEW and Parquet are already ready vPerformance vRCFile+Snappy is the fastest on both Cluster1 and Cluster2 vIf use larger size table, Parquet+Snappy may be the fastest vHope for future extension lSupport Structure Types lSupport UDF/UDTF, etc Summary 9 10. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 10 Appendix. Benchmark Details 11. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our System Environment(Cluster1) 11 v Install using Cloudera Manager Free Edition 4.6.0 Master Slave 14 Servers All servers are connected with 1Gbps Ethernet through an L2 switch Active NameNode DataNode TaskTracker Impalad Stand-by NameNode JobTracker statestored 3 Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode DataNode DataNode DataNode 12. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our System Environment(Cluster2) 12 v Install using Cloudera Manager Free Edition 4.6.0 Master Slave 10 Servers All servers are connected with 1Gbps Ethernet through an L2 switch Active NameNode DataNode TaskTracker Impalad Stand-by NameNode JobTracker statestored 3 Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode DataNode DataNode DataNode Decommissioned 13. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / vCPU lIntel Core 2 Duo 2.13 GHz with Hyper Threading vMemory l8GB : Namenodes only l4GB : Others vDisk l7,200 rpm SATA mechanical Hard Disk Drive * 1 vOS lCent OS 6.3 Our Server Specication 13 14. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Use CDH4.3.0 + Impala 1.1 v Use hivebench in open-sourced benchmark tool HiBench l https://github.com/hibench v Modied datasets to 1/10 scale l Default conguration generates table with 1 billion rows v Modied query sentence l Deleted INSERT INTO TABLE to evaluate read-only performance v Combines a few storage format with a few compression method l TextFile, SequenceFile, RCFile, ParquestFile l No compression, Gzip, Snappy v Comparison with job query latency v Average job latency over 5 measurements v Benchmark on both Cluster1 and Cluster2 Benchmark 14 15. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Uservisits table 100 million rows 16,895 MB as TextFile Table Denitions sourceIPstring destURLstring visitDatestring adRevenuedouble userAgentstring countryCodestring languageCode string searchWordstring durationint Rankings table 12 million rows 744 MB as TextFile Table Denitions pageURL string pageRank int avgDuration int Modied Datasets 15 16. Copyright CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / SELECT sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank) FROM rankings_t R JOIN [BROADCAST] ( SELECT sourceIP, destURL, adRevenue FROM uservisits_t UV WHERE (datedi(UV.visitDate, '1999-01-01')>=0 AND datedi(UV.visitDate, '2000-01-01')

View more >