an in-depth look at putting the sting in hive

16
Putting the Sting in Hive Page 1 Alan F. Gates @alanfgates

Upload: hadoop-summit

Post on 04-Jul-2015

1.159 views

Category:

Technology


3 download

DESCRIPTION

Apache Hive is the most widely used SQL interface for Hadoop. As Hadoop usage continues its explosive growth, Hive`s performance and features do not meet the requirements and expectation of many users. This includes answering queries in human time (less than 30 seconds) and support for common analytics operations. The Hive community has risen to the challenge. Work is being done to drive down start up time of a Hive query, extend Hive to work on Tez (a Hadoop execution environment that is much faster than MapReduce), make Hive operators process records at 10x more than their current speed, add support for analytics and windowing functions such as RANK, NTILE, LEAD, LAG, etc., and add support to Hive for standard SQL datatypes. This talk will discuss the design and code changes that have been done as well as look at ongoing work and additional optimizations and features that could be added in the future.

TRANSCRIPT

Page 1: An In-Depth Look at Putting the Sting in Hive

Putting the Sting in

Hive

Page 1

Alan F. Gates

@alanfgates

Page 2: An In-Depth Look at Putting the Sting in Hive

Stinger Overview

Page 2

•An initiative, not a project or product

• Includes changes to Hive and a new project Tez

•Two main goals

–Improve Hive performance 100x over Hive 0.10

–Extend Hive SQL to include features needed for

analytics

•Hive will support:

–BI tools connecting to Hadoop

–Analysts performing ad-hoc, interactive queries

–Still excellent at the large batch jobs it is used for today

© 2013 Hortonworks

Page 3: An In-Depth Look at Putting the Sting in Hive

Stinger Mileposts

Page 3© 2013 Hortonworks

Stinger Phase 3• Buffer Cache• Cost Based

Optimizer

Stinger Phase 2

• YARN Resource Mgmnt• Hive on Apache Tez• Query Service• Vectorized Operators

Stinger Phase 1

• Base Optimizations• SQL Analytics• ORCFile Format

1 2Improve existing tools & preserve

investments

Enable Hive to support interactive

workloads

Released in

Hive 0.11

Current

WorkRoadmap

Page 4: An In-Depth Look at Putting the Sting in Hive

Hive Performance Gains in 0.11

Page 4© 2013 Hortonworks

• Enable star joins by improving Hive’s map join (aka

broadcast join)

–Where possible do in single map only task

–When not possible push larger tables to separate tasks

• Collapse adjacent jobs where possible

–Hive has lots of M->MR type plans, collapse these to MR

–Collapse adjacent jobs on sufficiently similar keys when

feasible

– join followed by group

– join followed by order

– group followed by order

• Improvements in sort merge bucket (SMB) joins

Page 5: An In-Depth Look at Putting the Sting in Hive

Page ‹#›

Page 6: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

Before

Page 6

Page 7: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

After

Page 7

Page 8: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

Improvements in SMB Joins

• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)

3257.692

2862.669

255.641

71.114

0

500

1000

1500

2000

2500

3000

3500

Query 82

Text

RCFile

Partitioned RCFile

Partitioned RCFile + Optimizations

Page 8

Page 9: An In-Depth Look at Putting the Sting in Hive

New Technologies in Hive

Page 9© 2013 Hortonworks

• All covered in depth in other talks

– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today

• Tez – A new execution engine for relational tools such as Hive

– No need to use MapReduce, instead provides general DAG execution

– Data moved between tasks via socket, disk, or HDFS based on performance / re-startability trade off

– Provides standing service to greatly reduce query start time

• ORCFile – A rewrite of RCFile

– Columnar

– Tightly integrated with Hive’s type model, including support for nested types

– Much better compression

– Supports projection and filter push down

• Vectorization – Rewriting operators to take advantage of modern processors

– Based on work done in MonetDB

– Rewrite operators to radically reduce number of function calls, branch prediction misses, and cache misses

Page 10: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

Standard Queries

Page 10

260

165

38

77

142

296

38 42

67

80

0

50

100

150

200

250

300

Query 27Scale 200

Query 82Scale 200

Query 27Scale 1000

Query 82Scale 1000

Query 27 Star JoinQuery 82 Fact Table Join

Hive 0.10, RC File

Hive 0.11 CP, RC File

Hive 0.11 CP, ORC File

Page 11: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

Performance Trajectory

Page 11

1X2X

12X11X

21X

0X

5X

10X

15X

20X

25X

Hive 10Text

Hive 10RC

Hive 11RC

Hive 11ORC

Hive 11 CPORC, Tez…

Query 27 Speedup

1X

14X

44X

57X

78X

0X

10X

20X

30X

40X

50X

60X

70X

80X

90X

Hive 10Text

Hive 10RC

Hive 11RC

Hive 11ORC

Hive 11 CPORC, Tez

Query 82 Speedup

Page 12: An In-Depth Look at Putting the Sting in Hive

© Hortonworks Inc. 2013

Query 12 – Demonstrating MRR

Page 12

55 54

75

65

35 34

55

46

0

10

20

30

40

50

60

70

80

RC File Scale 200

ORC File Scale 200

RC File Scale 1000

ORC File Scale 1000

Elap

sed

Tim

e (s

eco

nd

s)

Query 12 - MRR Optimization

Traditional Map-Reduce

Tez MapReduce Reduce

Page 13: An In-Depth Look at Putting the Sting in Hive

Hive Performance Up Next

Page 13© 2013 Hortonworks

• Push down start up time - even for queries that spend less than a

second running on the cluster, there is ~15 seconds of start up time

– Tez service will remove Hadoop startup issues

– Need to reduce time for the metadata access

– Need intelligent file caching so that hot tables can be kept in memory

• Keep working on the optimizer

– Y Smart work from Ohio State University

– Start using statistics to make intelligent decisions about how many mappers and

reducers to spawn – maybe in Hive, maybe in Tez

– Start using statistics to choose between competing plan options

• Buffer Cache

– Coordinate with HDFS team to determine caching strategy

Page 14: An In-Depth Look at Putting the Sting in Hive

Extending Hive SQL in 0.11

Page 14© 2013 Hortonworks

• DECIMAL data type – for fixed precision calculation (e.g. currency)

• OVER clause

– PARTITION BY, ORDER BY, ROWS

BETWEEN/FOLLOWING/PRECEDING

– Works with existing aggregate functions

– New analytic and window functions added

– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE

, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK

SELECT salesperson, AVG(salesprice) OVER

(PARTITION BY region ORDER BY date

ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)

FROM sales;

Page 15: An In-Depth Look at Putting the Sting in Hive

Extending Hive SQL Post 0.11

Page 15© 2013 Hortonworks

• Subqueries in WHERE

– Non-correlated first

– [NOT] IN first, then extend to (in)equalities and EXISTS

• Datatype conformance – Hive has Java type model, add support for

SQL types:

– DATE

– CHAR() and VARCHAR()

– add precision and scale to decimal and float

– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =

int, real/number = decimal)

• Security

– Add security checks to views, indices, functions, etc.

– Secure GRANT and REVOKE

Page 16: An In-Depth Look at Putting the Sting in Hive

Questions

Page 16© 2013 Hortonworks