interactive query in hadoop

20
Page 1 © Hortonworks Inc. 2014 Interactive Query In Hadoop Rommel Garcia Solutions Engineer May 3, 2014 Hortonworks. We do Hadoop.

Upload: rommel-garcia

Post on 27-Jan-2015

109 views

Category:

Technology


1 download

DESCRIPTION

Hive 13 & Tez providing Human Interactive Query across petabytes of data.

TRANSCRIPT

Page 1: Interactive query in hadoop

Page 1 © Hortonworks Inc. 2014

Interactive Query In Hadoop

Rommel Garcia

Solutions Engineer

May 3, 2014

Hortonworks. We do Hadoop.

Page 2: Interactive query in hadoop

Page 2 © Hortonworks Inc. 2014

Hadoop 2

Multi Use Data PlatformBatch, Interactive, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard QueryProcessing

Hive, Pig

BatchMapReduce

Online Data Processing

HBase, Accumulo

InteractiveTez

Real Time Stream Processing

Stormothers

Page 3: Interactive query in hadoop

Page 3 © Hortonworks Inc. 2014

The Interactive Query Tech Stack

Hive

Tez

YARN

HDFS

SQL

DAG

Resource

Storage

Page 4: Interactive query in hadoop

Page 4 © Hortonworks Inc. 2014

Hive

Page 5: Interactive query in hadoop

Page 5 © Hortonworks Inc. 2014

Hive

Open source project that

• facilitates querying (SQL compliant)• project structure

residing in a distributed storage like HDFS.

Page 6: Interactive query in hadoop

Page 6 © Hortonworks Inc. 2014

Hive SQL Compliance

Page 7: Interactive query in hadoop

Page 7 © Hortonworks Inc. 2014

Hive Performance

Page 7

Feature Description Benefit

Tez Integration Tez is significantly better engine than MapReduce Latency

Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput

Query PlannerUsing extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)

Latency

ORC File Columnar, type aware format with indices LatencyCost Based Optimizer

(Optiq)Join re-ordering and other optimizations based on column statistics including histograms etc. Latency

Page 8: Interactive query in hadoop

Page 8 © Hortonworks Inc. 2014

Vectorization Using Modern CPU

CPU

10K rows

Page 9: Interactive query in hadoop

Page 9 © Hortonworks Inc. 2014

Hive Optimizations

• Pre-warmed Containers (Hive Query Server)

• Low-latency Dispatch (Hive Query Server)

• DAG utilization (Tez)

• Buffer Caching (cache accessed data)

• Predicate Pushdown

Page 10: Interactive query in hadoop

Page 10 © Hortonworks Inc. 2014

Hive - ORCFile

Page 11: Interactive query in hadoop

Page 11 © Hortonworks Inc. 2014

Tez

Page 12: Interactive query in hadoop

Page 12 © Hortonworks Inc. 2014

Tez – Introduction

• Distributed execution framework targeted towards data-processing applications.

• Express computation as a dataflow graph.

• Flexible Input-Processor-Output runtime model

• Extensively use caching

• Data type agnostic

• Built on top of YARN

• Apache licensed.

Page 13: Interactive query in hadoop

Page 13 © Hortonworks Inc. 2014

Feature Description Benefit

Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency

Tez Container Pre-Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency

Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!

Latency

Runtime re-configuration of DAG

Runtime query tuning by picking aggregation parallelism using online query statistics Throughput

Tez In-Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput

Hive On Tez - Execution

Page 14: Interactive query in hadoop

Page 14 © Hortonworks Inc. 2014

SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b on (a.id = b.id)

JOIN c on (a.itemId = c.itemId)

GROUP by a.state

Comparing Tez vs. MR – running queries in Hive

• To express the above query in MapReduce, Hive needs to compose and execute four separate MR jobs.

• Each MR job comes at a cost of job start-up and disk I/O as the results are written and re-read between MR jobs. This takes too long!

Page 15: Interactive query in hadoop

Page 15 © Hortonworks Inc. 2014

SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b on (a.id = b.id)

JOIN c on (a.itemId = c.itemId)

GROUP by a.state

Comparing Tez vs. MR – running queries in Hive

• Using the Tez framework, this query can be expressed as a single executing graph.

• No wasted I/O. Each node in the graph streams results to the next node.

• No wasted job start up. Tez provides “hot containers” for jobs to be immediately submitted.

Page 16: Interactive query in hadoop

Page 16 © Hortonworks Inc. 2014

Tez – Deep Dive – API

DAG dag = new DAG();

Vertex map1 = new Vertex(MapProcessor.class);

Vertex map2 = new Vertex(MapProcessor.class);

Vertex reduce1 = new Vertex(ReduceProcessor.class);

Vertex reduce2 = new Vertex(ReduceProcessor.class);

Vertex join1 = new Vertex(JoinProcessor.class);

…….

Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

…….

dag.addVertex(map1).addVertex(map2)

.addVertex(reduce1).addVertex(reduce2)

.addVertex(join1)

.addEdge(edge1).addEdge(edge2)

.addEdge(edge3).addEdge(edge4);

reduce1

map2

reduce2

join1

map1

Scatter_Gather

Bipartite Sequential

Scatter_Gather

Bipartite Sequential

Simple DAG definition API

Page 17: Interactive query in hadoop

Page 17 © Hortonworks Inc. 2014

Demo

Hive 13 + Tez

Page 18: Interactive query in hadoop

Page 18 © Hortonworks Inc. 2014

Multi-Tenancy with HiveServer2

Resource contentions may exists when multiple users run very large queries simultaneously which affects overall query latency. Apply these controls to resolve it.• Container re-use timeout• Tez split wave tuning• Round Robin Queuing setup

Page 19: Interactive query in hadoop

Page 19 © Hortonworks Inc. 2014

Tez - Waves

queue

C.1

C.2

C.3

C.4

C.5

containers

TEZ

tez.am.grouping.split-waves=3.0

15 Tasks

T.1

T.2

T.3

T.4

T.5

Page 20: Interactive query in hadoop

Page 20 © Hortonworks Inc. 2014

Thank You!Rommel GarciaHortonworks@rommelgarcia