effective hive queries

14
Effec%ve Hive Queries Secrets From the Pros We will be star,ng at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Ques%ons! Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?

Upload: qubole

Post on 31-Oct-2014

355 views

Category:

Technology


1 download

DESCRIPTION

A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by a. Better layout of data on HDFS via partitioning and bucketing b. Designing test queries by using block and bucket sampling before running the queries on large datasets c. Using bucket map joins and parallel processing to run queries faster Visit www.qubole.com for more information.

TRANSCRIPT

Page 1: Effective Hive Queries

Effec%ve  Hive  Queries                                Secrets  From  the  Pros  

We  will  be  star,ng  at  11:03  PDT  Use  the  Chat  Pane  in  GoToWebinar  to  Ask  Ques%ons!  

Assess  your  level  and  learn  new  stuff  This  webinar  is  intended  for  intermediate  audiences  (familiar  with  Apache  Hive  and  Hadoop,  but  not  experts)  ?  

Page 2: Effective Hive Queries

AGENDA  This  Webinar  provides  %ps  on  improving  the  performance  and  beJer  u%lizing  resources  using  the  following  best  prac%ces:  

•  Data  Layout  (Par%%ons  and  Buckets)  •  Data  Sampling  (Bucket  and  Block  sampling)  •  Data  Processing  (Bucket  Map  Join  and  Parallel  execu%on)  

Page 3: Effective Hive Queries

Dataset  Used  

#  of  records:    276M  records  Columns:  I%nerary  ID  Year  &Quarter  of  Travel  Trip  Origin  City  &  State  Trip  Des%na%on  City  &  State  Distance  between  Origin  &  Des%na%on  

Airline  Bookings  All  Includes  stops  at  intermediate  ci%es  

#  of  records:    116M  records  Columns:  I%nerary  ID  Year  &Quarter  of  Travel  Trip  Origin  City  &  State  Trip  Des%na%on  City  &  State  Distance  between  Origin  &  Des%na%on  

Airline  Bookings  Origin  Only  Only  first  leg  of  travel  

#  of  records:    50  Columns:  State  code  &  Name  Popula%on  

Census  Human  popula%on  by  US  State  

Page 4: Effective Hive Queries

#1  -­‐  Data  Par%%oning    •  Problem  PaJern  

–  Query  a  subset  of  data  in  a  table  –  Subset  iden%fied  by  “Column_Name  =  X”  filter  

•  Solu%on  paJern  –  Layout  data  in  sub-­‐directories  with  each  directory  associated  with  a  value  of  the  par%%on  column  

–  The  filter  on  par%%on  column  just  picks  a  single  sub  directory  •  Approach  

–  Use  PARTITION  BY  clause  •  Benefit  

–  Par%%on  pruning  –  2.7x  faster  on  a  query  on  Airline  Bookings  Dataset  (29  seconds)  

Page 5: Effective Hive Queries

#1  -­‐  Data  Par%%oning  

Airline  Bookings  All  Table  

Origin  State  (Par%%on  Column  /  Sub-­‐directory)   CA   WY  AL  

File1001.dat  

File1002.dat  

File100n.dat  

File3001.dat  

File3002.dat  

File300n.dat  

Filex001.dat  

Filex002.dat  

Filex00n.dat  

Files  inside  the  par%%on  

SELECT  origin_city,  origin_state  FROM  Airline_Bookings_All  WHERE  origin_state  =  ‘CA’  

CREATE  TABLE  Airline_Bookings_All  ….  PARTITIONED  BY  (origin_state  STRING)  

Page 6: Effective Hive Queries

#2  -­‐  Data  Bucke%ng  

•  Problem  PaJern  –  Join  data  in  two  large  tables  efficiently  –  Sample  data  inside  a  table  efficiently  

•  Solu%on  paJern  – More  efficient  processing  by  storing  data  in  hash  buckets  

•  Approach  •  Use  bucke%ng  using  CLUSTERED  BY  ..  INTO  n  BUCKETS  

•  Benefit  –  Bucket  Map  Join  –  Bucket  Sampling  

Page 7: Effective Hive Queries

#2  –  Data  Bucke%ng  CREATE  TABLE  Airline_Bookings_All  …  CLUSTERED  BY  (i%nid)  INTO  64  BUCKETS  

set  hive.enforce.bucke%ng  =  true;  INSERT  OVERWRITE  TABLE  Airline_Bookings_All  SELECT  …  FROM  ..  

Ailrine_Bookings_All  

File00.dat  

File63.dat  

File01.dat  Each  File  contains  all  

the  rows  that  correspond  to  the  same  hash  of  i%nid  

column  

Page 8: Effective Hive Queries

#2  -­‐  Data  Bucke%ng  

a  

File1001.dat  

File1002.dat  

File100n.dat  

Filex001.dat  

Filex002.dat  

Filex00m.dat  

Files  containing  table  data  bucketed  on  a  

column  

b  

set  hive.op%mize.bucketmapjoin  =  true;    SELECT  /*+  MAPJOIN(a,  b)  */  a.*,  b.*  FROM  Airline_Bookings_All  a  JOIN  Airline_Bookings_Origin_Only  b  ON  a.i%nid  =  b.i%nid  

Note:    1.  Both  the  tables  are  bucketed  on  i%nid  column  2.  The  numbers  of  buckets  in  the  two  tables  are  a  strict  mul%ple  of  each  other  

Page 9: Effective Hive Queries

#3  -­‐  Bucket  Sampling  

•  Problem  PaJern  – Work  on  joinable  samples  of  data  from  different  tables  

•  Solu%on  paJern  –  Use  Bucket  Sampling  

•  Approach  •  TABLESAMPLE  (BUCKET  x  OUT  OF  Y  ON  column)  

•  Benefit  –  Useful  while  working  with  sample  data  and  joins  

Page 10: Effective Hive Queries

#3  -­‐  Bucket  Sampling  

Filex002.dat  

Filex030.dat  

Filex064.dat  

Files  containing  bookings  data  bucketed  on  i%nid  

a  

SELECT  a.*,  b.*  FROM  Airline_Bookings_All  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  a                      ,  Airline_Bookings_Origin_Only  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  b  WHERE  a.i%nid  =  b.i%nid  

Filex001.dat  

Filex063.dat  

Filey002.dat  

Filey030.dat  

Filey064.dat  

b  

Filey001.dat  

Filey063.dat  

Page 11: Effective Hive Queries

#4  –  Block  Sampling  •  Problem  PaJern  

–  View  a  sample  of  a  data  with  in  a  table  –  Sample  size  expressed  as  number  of  rows,  %age  of  data,  or  number  of  MBs  

•  Solu%on  paJern  –  Use  Block  sampling  

•  Approach  –  Use  TABLESAMPLE  (n%,  nM,  or  n  ROWS)  

•  Benefit  –  Geyng  a  random  sample  from  the  table  –  More  op%ons  to  specify  how  many  samples  to  generate  

Page 12: Effective Hive Queries

#5  –  Parallel  Execu%on  

SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (

SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8

)a JOIN (

SELECT itinid, origin, originstate FROM air_travel_origins_8

)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY

a.year, a.quarter, a.origin, a.originstate;

Stage  1  

Stage  2  

Stage  3  

Stage  1  

Stage  2  

Stage  3  

Stage  1   Stage  2  

Stage  3  

set  hive.exec.parallel  =  false;  

set  hive.exec.parallel  =  true;  

Page 13: Effective Hive Queries

Summary  

•  Iterate  quickly  on  Query  Design  – Use  Bucket  and  Block  Sampling  

•  Run  queries  faster  – Par%%oning  to  invoke  Par%%on  Pruning  – Bucke%ng  to  invoke  Bucket  Map  Joins  – Execute  complex  queries  in  parallel  

Page 14: Effective Hive Queries

THANK  YOU  

Managed  Cluster   Built-­‐In  Connectors   Friendly  User-­‐Interface   Dedicated  Support  

•  100%  Managed  Hadoop  Cluster  in  the  Cloud  •  Auto-­‐Scaling  Cluster.  Full  Life-­‐cycle  Management  •  +12  Connectors  to  Applica%ons  and  Data  Sources  •  14-­‐Day  Free  Trial  (free  account  available)  •  24/7  Customer  Support  

What’s  Included?  

è  www.qubole.com/try  ç