interactive query in hadoop

© Hortonworks Inc. 2014

Interactive Query In Hadoop

Rommel Garcia

Solutions Engineer

May 3, 2014

Hortonworks. We do Hadoop.


Hadoop 2

Multi Use Data PlatformBatch, Interactive, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard QueryProcessing

Hive, Pig

BatchMapReduce

Online Data Processing

HBase, Accumulo

InteractiveTez

Real Time Stream Processing

Stormothers

…


The Interactive Query Tech Stack

Hive

Tez

YARN

HDFS

SQL

DAG

Resource

Storage


Hive


Hive

Open source project that

• facilitates querying (SQL compliant)• project structure

residing in a distributed storage like HDFS.


Hive SQL Compliance


Hive Performance

Page 7

Feature Description Benefit

Tez Integration Tez is significantly better engine than MapReduce Latency

Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput

Query PlannerUsing extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)

Latency

ORC File Columnar, type aware format with indices LatencyCost Based Optimizer

(Optiq)Join re-ordering and other optimizations based on column statistics including histograms etc. Latency


Vectorization Using Modern CPU

CPU

10K rows


Hive Optimizations

• Pre-warmed Containers (Hive Query Server)

• Low-latency Dispatch (Hive Query Server)

• DAG utilization (Tez)

• Buffer Caching (cache accessed data)

• Predicate Pushdown


Hive - ORCFile


Tez


Tez – Introduction

• Distributed execution framework targeted towards data-processing applications.

• Express computation as a dataflow graph.

• Flexible Input-Processor-Output runtime model

• Extensively use caching

• Data type agnostic

• Built on top of YARN

• Apache licensed.


Feature Description Benefit

Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency

Tez Container Pre-Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency

Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!

Latency

Runtime re-configuration of DAG

Runtime query tuning by picking aggregation parallelism using online query statistics Throughput

Tez In-Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput

Hive On Tez - Execution


SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b on (a.id = b.id)

JOIN c on (a.itemId = c.itemId)

GROUP by a.state

Comparing Tez vs. MR – running queries in Hive

• To express the above query in MapReduce, Hive needs to compose and execute four separate MR jobs.

• Each MR job comes at a cost of job start-up and disk I/O as the results are written and re-read between MR jobs. This takes too long!


SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b on (a.id = b.id)

JOIN c on (a.itemId = c.itemId)

GROUP by a.state

Comparing Tez vs. MR – running queries in Hive

• Using the Tez framework, this query can be expressed as a single executing graph.

• No wasted I/O. Each node in the graph streams results to the next node.

• No wasted job start up. Tez provides “hot containers” for jobs to be immediately submitted.


Tez – Deep Dive – API

DAG dag = new DAG();

Vertex map1 = new Vertex(MapProcessor.class);

Vertex map2 = new Vertex(MapProcessor.class);

Vertex reduce1 = new Vertex(ReduceProcessor.class);

Vertex reduce2 = new Vertex(ReduceProcessor.class);

Vertex join1 = new Vertex(JoinProcessor.class);

…….

Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

…….

dag.addVertex(map1).addVertex(map2)

.addVertex(reduce1).addVertex(reduce2)

.addVertex(join1)

.addEdge(edge1).addEdge(edge2)

.addEdge(edge3).addEdge(edge4);

reduce1

map2

reduce2

join1

map1

Scatter_Gather

Bipartite Sequential

Scatter_Gather

Bipartite Sequential

Simple DAG definition API


Demo

Hive 13 + Tez


Multi-Tenancy with HiveServer2

Resource contentions may exists when multiple users run very large queries simultaneously which affects overall query latency. Apply these controls to resolve it.• Container re-use timeout• Tez split wave tuning• Round Robin Queuing setup


Tez - Waves

queue

C.1

C.2

C.3

C.4

C.5

containers

TEZ

tez.am.grouping.split-waves=3.0

15 Tasks

T.1

T.2

T.3

T.4

T.5


Thank You!Rommel GarciaHortonworks@rommelgarcia

interactive query in hadoop

Technology

tez framework

tez execution

page13 hortonworks

page5 hortonworks

page16 hortonworks

page7 hortonworks

page12 hortonworks

page6 hortonworks