evaluation of cloudera impala 1.1

17
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Evaluation of Cloudera impala 1.1 Aug 7, 2013 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

Upload: yukinori-suda

Post on 15-Jan-2015

1.636 views

Category:

Technology


0 download

DESCRIPTION

I evaluated impala 1.1 on our cluster environment.

TRANSCRIPT

Page 1: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1

Evaluation  of  Cloudera  impala  1.1

Aug  7,  2013CELLANT  Corp.  R&D  Strategy  Division

Yukinori  SUDA@sudabon

Page 2: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Sentry  support:l  Fine-‐‑‒grained  authorizationl  Role-‐‑‒based  authorization

v  Support  for  viewsv  Performance  improvements

l  Parquet  columnar  performancel  More  efficient  metadata  refresh  for  larger  installations

v  Additional  SQLl  SQL-‐‑‒89  joins  (in  addition  to  existing  SQL-‐‑‒92)l  LOAD  functionl  REFRESH  command  for  JDBC/ODBC

v  Improved  Hbase  support:l  Binary  typesl  Caching  configuration

v  Fixed  many  bugs

Cloudera  Impala  1.1  was  released  !!

2

Page 3: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Hive  ⇒  Impalal On  Impala  shell,  can  read  data  in  “VIEW”  that  was  created  via  Hive  command  ?

v Impala  ⇒  Hivel On  Hive  shell,  can  read  data  in  “VIEW”  that  was  created  via  Impala  command  ?

v ResultTwo  “VIEW”s  have  compatibility

Check  compatibility  of  “VIEW”

3

Page 4: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Check  performance  (Hive  on  Cluster1)

4

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

TextFile

SequenceFile

RCFile

222.039

244.67

239.182

228.801

230.327

Avg.  Job  Latency  [sec]

This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Hive on Cluster2)”.

Page 5: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Check  performance  (Impala  on  Cluster1)

5

0 50 100 150 200 250

No  Comp.

Gzip

Snappy

Gzip

Snappy

Snappy

Text

File

Sequence

File

RCFile

Parquet

File

23.518

32.155

28.617

20.774

12.654

13.146

Avg.  Job  Latency  [sec]

This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Impala on Cluster2)”.

Page 6: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Check  performance  (Hive  on  Cluster2)

6

0 50 100 150 200 250 300

No  Comp.

Gzip

Snappy

Gzip

Snappy

TextFile

SequenceFile

RCFile

272.176

249.531

245.009

230.034

216.802

Avg.  Job  Latency  [sec]

Page 7: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Check  performance  (Impala  on  Cluster2)

7

0 50 100 150 200 250 300

No  Comp.

Gzip

Snappy

Gzip

Snappy

Snappy

Text

File

Sequence

File

RCFile

Parquet

File

32.528

28.73

21.173

24.794

14.308

19.814

Avg.  Job  Latency  [sec]

Page 8: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v IMPALA-‐‑‒357l Insert  into  Parquet  exceed  mem-‐‑‒limit

v Probleml Even  if  set  mem_̲limit  setting,  when  create  ParquetFile  table  with  partitions,  consumed  memory  isnʼ’t  limited.  

l At  last,  Impalad  crashes  due  to  memory  shortage

v ResultCREATE  command  failed  due  to  memory  limit

Check  fixed  bug

8

Page 9: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v Thanks  to  dev.  team,  Impala  is  also  going  from  “Good  to  Great”

v Both  “VIEW”  and  “Parquet”  are  already  readyv Performance

v RCFile+Snappy  is  the  fastest  on  both  Cluster1  and  Cluster2

v If  use  larger  size  table,  Parquet+Snappy  may  be  the  fastest

v Hope  for  future  extensionl Support  Structure  Typesl Support  UDF/UDTF,  etc

Summary

9

Page 10: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 10

Appendix.  Benchmark  Details

Page 11: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Our  System  Environment(Cluster1)

11

v  Install  using  Cloudera  Manager  Free  Edition  4.6.0

Master Slave

14  Servers

All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch

ActiveNameNode

DataNodeTaskTrackerImpalad

Stand-‐‑‒byNameNode

JobTrackerstatestored

3  Servers

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNode

DataNode

DataNode

DataNode

Page 12: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

Our  System  Environment(Cluster2)

12

v  Install  using  Cloudera  Manager  Free  Edition  4.6.0

Master Slave

10  Servers

All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch

ActiveNameNode

DataNodeTaskTrackerImpalad

Stand-‐‑‒byNameNode

JobTrackerstatestored

3  Servers

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNodeTaskTrackerImpalad

DataNode

DataNode

DataNode

DataNode

Decommissioned

Page 13: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v CPUl Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading

v Memoryl 8GB  :  Namenodes  onlyl 4GB  :  Others

v Diskl 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1

v OSl Cent  OS  6.3

Our  Server  Specification

13

Page 14: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

v  Use  CDH4.3.0  +  Impala  1.1v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench”

l  https://github.com/hibenchv  Modified  datasets  to  1/10  scale

l  Default  configuration  generates  table  with  1  billion  rowsv  Modified  query  sentence

l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performancev  Combines  a  few  storage  format  with  a  few  compression  method

l  TextFile,  SequenceFile,  RCFile,  ParquestFilel  No  compression,  Gzip,  Snappy

v  Comparison  with  job  query  latencyv  Average  job  latency  over  5  measurementsv  Benchmark  on  both  Cluster1  and  Cluster2

Benchmark

14

Page 15: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

•  Uservisits  table–  100  million  rows–  16,895  MB  as  TextFile–  Table  Definitions

•  sourceIP  string•  destURL  string•  visitDate  string•  adRevenue  double•  userAgent  string•  countryCode  string•  languageCode  string•  searchWord  string•  duration  int

•  Rankings  table–  12  million  rows–  744  MB  as  TextFile–  Table  Definitions

•  pageURL string•  pageRank int•  avgDuration int

Modified  Datasets

15

Page 16: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /

SELECT  sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)  FROM  rankings_̲t  RJOIN  [BROADCAST]  (  SELECT    sourceIP,    destURL,    adRevenue  FROM    uservisits_̲t  UV  WHERE    (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0    AND    datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)  )  NUV

ON  (R.pageURL  =  NUV.destURL)group  by  sourceIPorder  by  totalRevenue  DESClimit  1;

Modified  Query

16

Page 17: Evaluation of cloudera impala 1.1

Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 17

Thanks!I  want  to  use  TPC  in  next  evaluation…