overview of the hive stinger initiative

18
Overview of the Hive Stinger Initiative Eric N. Hanson Principal Software Development Engineer Microsoft HDInsight Team 30 June 2014

Upload: hadoop-user-group-france

Post on 10-May-2015

385 views

Category:

Technology


1 download

DESCRIPTION

Dr. Eric N. Hanson, Principal Software Development Engineer at Microsoft and Apache Hive committer presents the recent improvements in Hive

TRANSCRIPT

Page 1: Overview of the Hive Stinger Initiative

Overview of the Hive Stinger Initiative

Eric N. Hanson

Principal Software Development Engineer

Microsoft HDInsight Team

30 June 2014

Page 2: Overview of the Hive Stinger Initiative

What is Stinger? Umbrella term for…

• Faster query in Hive• ORC• Vectorization• Tez

• Better language features for analysis• Window functions etc.

Page 3: Overview of the Hive Stinger Initiative

Why Stinger?

• Hive has good functionality

• But it started out sloooowww

• Need to speed it up• keep it competitive • make it fun to use

Page 4: Overview of the Hive Stinger Initiative

ORC

• A good columnstore format

• Run length encoding, value encoding, dictionary encoding

• Layers stream compression over the top

• Written by Owen O’Malley

• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

Page 5: Overview of the Hive Stinger Initiative

Using ORC

• create table Tbl (col int) stored as orc;

• orc.compress default ZLIB

• See http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit

Page 6: Overview of the Hive Stinger Initiative

TPC-DS File Sizes

Page 6*Courtesy of Hortonworks

Page 7: Overview of the Hive Stinger Initiative

Vectorization

Page 7

Page 8: Overview of the Hive Stinger Initiative

How the code works (simplified)

Page 8

class LongColumnAddLongScalarExpression {int inputColumn;int outputColumn;long scalar;void evaluate(VectorizedRowBatch batch) {

long [] inVector =((LongColumnVector) batch.columns[inputColumn]).vector;

long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector;

if (batch.selectedInUse) {for (int j = 0; j < batch.size; j++) {

int i = batch.selected[j];outVector[i] = inVector[i] + scalar;

} } else {

for (int i = 0; i < batch.size; i++) {outVector[i] = inVector[i] + scalar;

} }

}}

}

No method callsLow instruction countCache locality to 1024 valuesNo pipeline stallsSIMD in Java 8

Page 9: Overview of the Hive Stinger Initiative

Vectorization and Compilation

• Vectorization “instructions” generated from templates

• Example’s:– Int add col-col

– Int add col-scalar

– Int add scalar-col

–Double add col-col

–Double add col-scalar

–Double add scalar-col

–And hundreds more!

• Pre-compilation of expressions

• Reduces # of function calls and instructions at runtime

• Expressions like (a + 2) / b are interpreted with these primitives

Page 10: Overview of the Hive Stinger Initiative

Example of vectorized template code

} else {

if (batch.selectedInUse) {

for(int j = 0; j != n; j++) {

int i = sel[j];

outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];

}

} else {

for(int i = 0; i != n; i++) {

outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];

}

}

}

Page 11: Overview of the Hive Stinger Initiative

Using vectorization in Hive

• set hive.vectorized.execution.enabled = true;

• Run query over ORC

• Only works for scalar types

• https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

• ~5X CPU reduction

Page 12: Overview of the Hive Stinger Initiative

Apache Tez (“Speed”)• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries

– Higher throughput for batch queries

– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task

ProcessorInput Output

*Courtesy of Hortonworks

Page 13: Overview of the Hive Stinger Initiative

Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for

Map-Reduce-Reduce

Map Processor

HDFS Input

Sorted Output

Reduce Processor

Shuffle Input

HDFS Output

Reduce Processor

Shuffle Input

Sorted Output

*Courtesy of Hortonworks

Page 14: Overview of the Hive Stinger Initiative

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-TezSELECT a.x, AVERAGE(b.y) AS avg

FROM a JOIN b ON (a.id = b.id) GROUP BY a

UNION SELECT x, AVERAGE(y) AS AVG

FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c)SELECT c.price

SELECT b.id

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state,c.itemId

JOIN (a, c)

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to HDFS

*Courtesy of Hortonworks

Page 15: Overview of the Hive Stinger Initiative

Tez Sessions

… because Map/Reduce query startup is expensive

• Tez Sessions–Hot containers ready for immediate use

–Removes task and job launch overhead (~5s – 30s)

• Hive–Session launch/shutdown in background (seamless, user not aware)

–Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc

*Courtesy of Hortonworks

Page 16: Overview of the Hive Stinger Initiative

Stinger Phase 3: Interactive Query In Hadoop

Page 16

Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)

190xImprovement

1400s

39s

7.2s

TPC-DS Query 27

3200s

65s

14.9s

TPC-DS Query 82

200xImprovement

Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables

All Results at Scale Factor 200 (Approximately 200GB Data)

*Courtesy of Hortonworks

Page 17: Overview of the Hive Stinger Initiative

How you can use Stinger enhancements

• Use Hive 13

• Use ORC: create table … stored as ORC

• Enable vectorization: set hive.vectorized.execution.enabled=true

• Enable Tez: set hive.execution.engine=tez

• See http://hortonworks.com/hadoop-tutorial/supercharging-

interactive-queries-hive-tez/

Page 18: Overview of the Hive Stinger Initiative

Reference(s)

• Stinger overview, Strata, fall 2013: http://www.slideshare.net/alanfgates/strata-stingertalk-oct2013?qid=09d16028-bd7e-47d8-8438-34f3242c6f0e&v=qf1&b=&from_search=1

Slides marked “Courtesy of Hortonworks” are from Hortonworks talks