impala 2.0 update #impalajp

37
1 Impala 2.0 Update Sho Shimauchi, Cloudera 2014/10/31

Upload: cloudera-japan

Post on 28-May-2015

3.178 views

Category:

Technology


1 download

DESCRIPTION

http://connpass.com/event/9031/

TRANSCRIPT

Page 1: Impala 2.0 Update #impalajp

1  

Impala  2.0  Update  Sho  Shimauchi,  Cloudera  2014/10/31  

Page 2: Impala 2.0 Update #impalajp

2  

Today’s  Topic  

• What  is  Cloudera  Impala?  •  Impala  1.4  /  2.0  update  

•  Performance  Improvement  •  Query  Language  •  Resource  Management  and  Security  •  Others  

Page 3: Impala 2.0 Update #impalajp

3  

Who  am  I  ?  

•  Pre-­‐sales  SoluLons  Architect  •  joined  Cloudera  in  2011,  the  first  Japanese  employee  at  Cloudera  

•  email:  [email protected]  •  twiTer:  @shiumachi  

Page 4: Impala 2.0 Update #impalajp

4  

Cloudera  Impala  

Page 5: Impala 2.0 Update #impalajp

5  

What  is  Impala?  

• MPP  SQL  query  engine  for  Hadoop  environment  •  wriTen  in  naLve  code  for  maximum  hardware  efficiency  

•  open-­‐source!  •  hTp://impala.io/  

•  Supported  by  Cloudera,  Amazon,  and  MapR  •  History  

•  2012/10  Public  Beta  released  •  2013/04  Impala  1.0  released  •  current  version:  Impala  2.0  

Page 6: Impala 2.0 Update #impalajp

6  

Impala  is  easy  to  use  

•  create  tables  as  virtual  views  over  data  stored  in  HDFS  /  HBase  •  schema  metadata  is  stored  in  Metastore  

•  shared  with  Hive,  Pig,  etc.    

•  connect  via  ODBC  /  JDBC  •  authenLcate  via  Kerberos  /  LDAP  •  run  standard  SQL  

•  ANSI  SQL-­‐92  based  •  limited  to  SELECT  and  bulk  INSERT    •  no  correlated  subqueries    available  in  2.0  •  UDF  /  UDAF  

Page 7: Impala 2.0 Update #impalajp

7  

Impala  1.4  (2014/07)  

•  DECIMAL(<precision>,  <scale>)  •  HDFS  caching  DDL  •  column  definiLon  based  on  Parquet  file  (CREATE  TABLE  …  LIKE  PARQUET)  •  ORDER  BY  without  LIMIT  •  LDAP  connecLons  through  TLS  •  SHOW  PARTITIONS  •  YARN  integrated  resource  manager  will  be  producLon  ready  •  Llama  HA  support  •  CREATE  TABLE  …  STORED  AS  AVRO  •  SUMMARY  command  in  impala-­‐shell  (provides  high-­‐level  summary  of  query  plan)  

•  faster  COMPUTE  STATS  •  Performance  improvements  for  parLLon  pruning  •  impala  shell  supports  UTF-­‐8  characters  •  addiLonal  built-­‐ins  from  EDW  systems  

Page 8: Impala 2.0 Update #impalajp

8  

Impala  2.0  (2014/10)  

•  hash  table  can  spill  to  disk  •  join  and  aggregate  tables  of  arbitrary  size  

•  Subquery  enhancements  •  allowed  in  WHERE  queries  •  EXISTS  /  NOT  EXISTS  •  IN  /  NOT  IN  can  operate  on  the  result  set  from  a  subquery  •  correlated  /  uncorrelated  subqueries  •  scalar  subqueries  

•  SQL  2003  compliant  analyLc  window  funcLons  •  LEAD(),  LAG(),  RANK(),  FIRST_VALUE(),  etc.  

•  New  Data  Type:  VARCHAR,  CHAR  •  Security  Enhancements  

•  mulLple  authenLcaLon  methods  •  GRANT  /  REVOKE  /  CREATE  ROLE  /  DROP  ROLE  /  SHOW  ROLES  /  etc.  

•  text  +  gzip  /  bzip2  /  Snappy  •  Hint  inside  views  •  QUERY_TIMEOUT_S  •  DATE_PART()  /  EXTRACT()  •  Parquet  default  block  size  is  changed  to  256MB  (was:  1GB)  •  LEFT  ANTI  JOIN  /  RIGHT  ANTI  JOIN  •  impala-­‐shell  can  read  sesngs  from  $HOME/.impalarc  

Page 9: Impala 2.0 Update #impalajp

9  

Performance  Improvement  

Page 10: Impala 2.0 Update #impalajp

10  

HDFS  caching  

• When  HDFS  files  are  cached  in  memory,  Impala  can  read  the  cached  data  without  any  disk  reads,  and  without  making  an  addiLonal  copy  of  the  data  in  memory  

•  avoids  checksumming  and  data  copies  •  new  HDFS  API  is  available  in  CDH  5.0  •  configure  cache  with  Impala  DDL  

•  CREATE  TABLE  tbl_name  CACHED  IN  ‘<pool>’  •  ALTER  TABLE  tbl_name  ADD  PARTITION  …  CACHED  IN  ‘<pool>’  

Page 11: Impala 2.0 Update #impalajp

11  

ParLLon  Pruning  improvement  

•   Previously,  Impala  typically  queried  tables  with  up  to  approximately  3000  parLLons.  With  the  performance  improvement  in  parLLon  pruning,  now  Impala  can  comfortably  handle  tables  with  tens  of  thousands  of  parLLons.  

Page 12: Impala 2.0 Update #impalajp

12  

Spilling  to  Disk  SQL  OperaLon  

•  write  temporary  data  to  when  Impala  is  close  to  exceeding  its  memory  limit    

•  In  PROFILE,  BlockMgr.BytesWriTen  counter  reports  how  much  data  was  wriTen  to  disk  during  the  query  

Page 13: Impala 2.0 Update #impalajp

13  

Query  Language  

Page 14: Impala 2.0 Update #impalajp

14  

Subquery  

Scalar  subquery:  produces  a  result  set  with  a  single  row  containing  a  single  column     SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);!

Uncorrelated  subquery:  not  refer  to  any  tables  from  the  outer  block  of  the  query    

SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);!

Correlated  subquery:    compare  one  or  more  values  from  the  outer  query  block  to  values  referenced  in  the  WHERE  clause  of  the  subquery    

SELECT employee_name, employee_id FROM employees one WHERE! salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id);!

Page 15: Impala 2.0 Update #impalajp

15  

AnalyLc  FuncLons    (a.k.a  Window  FuncLons)  

•  supported  in  2.0  and  later  •  supported  funcLons  

•  RANK()  /  DENSE_RANK()  •  FIRST_VALUE()  /  LAST_VALUE()  •  LAG()  /  LEAD()  •  ROW_NUMBER()  

•  Aggregate  funcLons  are  already  implemented  •  MAX(),  MIN(),  AVG(),  SUM(),  etc.  

Page 16: Impala 2.0 Update #impalajp

16  

AnalyLc  FuncLons  Example    

select stock_symbol, closing_date, closing_price,! lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"! from stock_ticker! order by closing_date;!+--------------+---------------------+---------------+-------------------+!| stock_symbol | closing_date | closing_price | yesterday closing |!+--------------+---------------------+---------------+-------------------+!| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |!| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |!| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |!| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |!| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |!| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |!| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |!+--------------+---------------------+---------------+-------------------+!

For  each  day,  the  query  prints  the  closing  price  alongside  the  previous  day's  closing  price:  

Page 17: Impala 2.0 Update #impalajp

17  

ApproximaLon  features  

•  APPX_COUNT_DISTINCT  query  opLon  •  rewrite  COUNT(DISTINCT)  calls  to  use  NDV()    •  speeds  up  the  operaLon  •  allows  mulLple  COUNT(DISTINCT)  in  a  single  query  

•  APPX_MEDIAN()  •  returns  a  value  that  is  approximately  the  median  (midpoint)  of  values  in  the  set  of  input  values  

Page 18: Impala 2.0 Update #impalajp

18  

Approx.  funcLons  example  

[localhost:21000] > select min(x), max(x), avg(x) from million_numbers;!+-------------------+-------------------+-------------------+!| min(x) | max(x) | avg(x) |!+-------------------+-------------------+-------------------+!| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |!+-------------------+-------------------+-------------------+![localhost:21000] > select appx_median(x) from million_numbers;!+----------------+!| appx_median(x) |!+----------------+!| 24721.6 |!+----------------+!

Page 19: Impala 2.0 Update #impalajp

19  

CREATE  TABLE  …  LIKE  PARQUET  

•  CREATE  TABLE  ...  LIKE  PARQUET  'hdfs_path_of_parquet_file'    

•  The  column  names  and  data  types  are  automaLcally  configured  based  on  the  Parquet  data  file  

Page 20: Impala 2.0 Update #impalajp

20  

ORDER  BY  without  LIMIT  

•  LIMIT  clause  is  now  opLonal  for  queries  that  use  the  ORDER  BY  clause  

•  Impala  automaLcally  uses  a  temporary  disk  work  area  to  perform  the  sort  if  the  sort  operaLon  would  otherwise  exceed  the  Impala  memory  limit  for  a  parLcular  data  node.  

Page 21: Impala 2.0 Update #impalajp

21  

DECODE()  

SELECT event, DECODE(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday”, 4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")! FROM calendar;!

Page 22: Impala 2.0 Update #impalajp

22  

ANTI  JOIN  

LEFT  ANTI  JOIN  /  RIGHT  ANTI  JOIN  are  supported  in  Impala  2.0  [localhost:21000] > create table t1 (x int);![localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);!![localhost:21000] > create table t2 (y int);![localhost:21000] > insert into t2 values (2), (4), (6);!![localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);!+---+!| x |!+---+!| 1 |!| 3 |!| 5 |!+---+!!

Page 23: Impala 2.0 Update #impalajp

23  

new  data  types  

•  DECIMAL  (Impala  1.4)  •  column_name  DECIMAL[(precision[,scale])]  

•  with  no  precision  or  scale  values  is  equivalent  to  DECIMAL(9,0)  

•  VARCHAR  (Impala  2.0)  •  STRING  with  a  max  length  

•  CHAR  (Impala  2.0)  •  STRING  with  a  precise  length  

Page 24: Impala 2.0 Update #impalajp

24  

new  built-­‐in  funcLons  

•  EXTRACT()  :  returns  one  date  or  Lme  field  from  a  TIMESTAMP  value  

•  TRUNC()  :  truncates  date/Lme  values  to  year,  month,  etc.  •  ADD_MONTHS():  alias  for  MONTHS_ADD()  •  ROUND():  rounds  DECIMAL  values    •  for  compuLng  properLes  for  staLsLcal  distribuLons  

•  STDDEV()  •  STDDEV_SAMP()  /  STDDEV_POP()  •  VARIANCE()  •  VARIANCE_SAMP()  /  VARIANCE_POP()  

•  MAX_INT()  /  MIN_SMALLINT()  •  IS_INF()  /  IS_NAN()  

Page 25: Impala 2.0 Update #impalajp

25  

SHOW  PARTITIONS  

[localhost:21000] > show partitions census;!+-------+-------+--------+------+---------+!| year | #Rows | #Files | Size | Format |!+-------+-------+--------+------+---------+!| 2000 | -1 | 0 | 0B | TEXT |!| 2004 | -1 | 0 | 0B | TEXT |!| 2008 | -1 | 0 | 0B | TEXT |!| 2010 | -1 | 0 | 0B | TEXT |!| 2011 | 4 | 1 | 22B | TEXT |!| 2012 | 4 | 1 | 22B | TEXT |!| 2013 | 1 | 1 | 231B | PARQUET |!| Total | 9 | 3 | 275B | |!+-------+-------+--------+------+---------+!!

Page 26: Impala 2.0 Update #impalajp

26  

SUMMARY  

•  impala-­‐shell  command  •  easy-­‐to-­‐digest  overview  of  the  Lmings  for  the  different  phases  of  execuLon  for  a  query  

[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;!+---------------------+!| avg(ss_sales_price) |!+---------------------+!| 37.80770926328327 |!+---------------------+![localhost:21000] > summary;!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |!| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |!| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |!| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!

Page 27: Impala 2.0 Update #impalajp

27  

SET  statement  

•  Before  Impala  2.0,  SET  can  be  used  only  in  impala-­‐shell  

•  In  Impala  2.0,  you  can  use  SET  in  client  app  through  JDBC  /  ODBC  APIs.  

Page 28: Impala 2.0 Update #impalajp

28  

Resource  Management  and  Security  

Page 29: Impala 2.0 Update #impalajp

29  

Admission  Control  (Impala  1.3)  

•  Fast  and  lightweight  resource  management  mechanism  

•  avoids  oversubscripLon  of  resources  for  concurrent  workloads  •  queries  are  queued  when  reaching  configurable  limits  

•  Run  on  every  impalad  •  no  SPOF  

Page 30: Impala 2.0 Update #impalajp

30  

YARN  and  Llama  

•  Llama:  Low  Latency  ApplicaLon  MAster  •  Subdivides  coarse-­‐grain  YARN  scheduling  into  finer-­‐granularity  for  low-­‐latency  and  short-­‐lived  queries  

•  Llama  registers  one  long-­‐lived  AM  per  YARN  pool  •  Llama  caches  resources  allocated  by  YARN  for  a  short  Lme,  so  that  they  can  be  quickly  re-­‐allocated  to  Impala  queries  •  much  faster  than  waiLng  for  YARN  

•  Impala  1.4:  GA.  Llama  HA  support  

Page 31: Impala 2.0 Update #impalajp

31  

Query  Timeout  

•  A  new  query  opLon,  QUERY_TIMEOUT_S,  lets  you  specify  a  Lmeout  period  in  seconds  for  individual  queries  

•  Note:  The  Lmeout  clock  for  queries  and  sessions  only  starts  Lcking  when  the  query  or  session  is  idle  

Page 32: Impala 2.0 Update #impalajp

32  

Security  

•  Impala  2.0  can  accept  either  kind  of  auth.  request  •  ex)  host  A  with  Kerberos,  and  host  B  with  LDAP  

•  Security  related  statement  •  GRANT  •  REVOKE  •  CREATE  ROLE  •  DROP  ROLE  •  SHOW  ROLES  •  SHOW  ROLE  GRANT  

•  -­‐-­‐disk_spill_encrypLon  opLon  

Page 33: Impala 2.0 Update #impalajp

33  

Others  

Page 34: Impala 2.0 Update #impalajp

34  

Text  +  gzip,  bzip2,  and  Snappy  

•  In  Impala  2.0  and  later,  Impala  supports  using  text  data  files  that  employ  gzip,  bzip2,  or  Snappy  compression  

•  use  ROW  FORMAT  with  delimiter  and  escape  character  to  create  table  

CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!

Page 35: Impala 2.0 Update #impalajp

35  

impala-­‐shell  

•  UTF-­‐8  support  (1.4)  •  .impalarc  file  (2.0)  [impala]!verbose=true!default_db=tpc_benchmarking!write_delimited=true!output_delimiter=,!output_file=/home/tester1/benchmark_results.csv!show_profiles=true!

Page 36: Impala 2.0 Update #impalajp

36  

DocumentaLon  

•  Cluster  Sizing  Guidelines  for  Impala  •  hTp://www.cloudera.com/content/cloudera/en/documentaLon/core/latest/topics/impala_cluster_sizing.html  

Page 37: Impala 2.0 Update #impalajp

37