apache tez – present and future

38
© Hortonworks Inc. 2015 Page 1 Apache Tez – Present and Future Jeff Zhang (@zjffdu) Rajesh Balamohan (@rajeshbalamohan)

Upload: jeff-zhang

Post on 10-Aug-2015

86 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Apache Tez – Present and Future

© Hortonworks Inc. 2015 Page 1

Apache Tez – Present and Future

Jeff Zhang (@zjffdu)Rajesh Balamohan (@rajeshbalamohan)

Page 2: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Agenda•Tez Introduction

•Tez Feature Deep Dive

•Tez Improvement & Debuggability

•Tez Status & Roadmap

Page 3: Apache Tez – Present and Future

© Hortonworks Inc. 2015

I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1 ( Join a & b )

Job 3 ( Group by of c )

Job 2 (Group by of a Join b)

Job 4 (Join of S & R )

Hive - MR

Example of MR versus Tez

Page 3

Single Job

Hive - Tez

Join a & b

Group by of a Join b

Group by of c

Job 4 (Join of S & R )

Page 4: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez – Introduction

Page 4

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph (DAG).

• Highly customizable to meet a broad spectrum of use cases.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache project and Apache licensed.

Page 5: Apache Tez – Present and Future

© Hortonworks Inc. 2015

What is DAG & Why DAG

ProjectionFilterGroupBy…

JoinUnionIntersect…

Split…

• Directed Acyclic Graph• Any complicated DAG can been composed of the following 3 basic

paradigm – Sequential– Merge– Divide

Page 6: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Expressing DAG in Tez API

• DAG API (Logic View)–Allow user to build DAG– Topological structure of the data computation flow

• Runtime API (Runtime View)–Application logic of each computation unit (vertex)–How to move/read/write data between vertices

Page 7: Apache Tez – Present and Future

© Hortonworks Inc. 2015

DAG API (Logic View)

Page 7

• Vertex (Processor, Parallelism, Resource, etc…)

• Edge (EdgeProperty)–DataMovement

– Scatter Gather (Join, GroupBy … )– Broadcast ( Pig Replicated Join / Hive Broadcast Join )– One-to-One ( Pig Order by )– Custom

Page 8: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Runtime API (Runtime View)

Page 8

ProcessorInput Output

• Input– Through which processor receives data on an edge– Vertex can have multiple inputs

• Processor– Application Logic (One vertex one processor)– Consume the inputs and produce the outputs

• Output– Through which processor writes data to an edge– One vertex can have multiple outputs

• Example of Input/Output/Processor– MRInput & MROutput (InputFormat/OutputFormat)– OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather)– UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One)– PigProcessor/HiveProcessor

Page 9: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Benefit of DAG• Easier to express computation in DAG

• No intermediate data written to HDFS

• Less pressure on NameNode

• No resource queuing effort & less resource contention

• More optimization opportunity with more global context

Page 10: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Agenda•Tez Introduction

•Tez Feature Deep Dive

•Tez Improvement & Debuggability

•Tez Status & Roadmap

Page 11: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Container-Reuse• Reuse the same container across DAG/Vertices/Tasks

• Benefit of Container-Reuse– Less resources consumed–Reduce overhead of launching JVM–Reduce overhead of negotiate with Resource Manager–Reduce overhead of resource localization–Reduce network IO–Object Caching (Object Sharing)

Page 12: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez Session• Multiple Jobs/DAGs in one AM

• Container-reuse across Jobs/DAGs

• Data sharing between Jobs/DAGs

Page 13: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Dynamic Parallelism Estimation • VertexManager

– Listen to the other vertices status

– Coordinate and schedule its tasks

– Communication between vertices

Page 14: Apache Tez – Present and Future

© Hortonworks Inc. 2015

ATS Integration• Tez is fully integrated with YARN ATS (Application Timeline

Service)–DAG Status, DAG Metrics, Task Status, Task Metrics are captured

• Diagnostics & Performance analysis–Data Source for monitoring & diagnostics –Data Source for performance analysis

Page 15: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Recovery• AM can crash in corner cases

–OOM–Node failure–…

• Continue from the last checkpoint

• Transparent to end users

AM Crash

Page 16: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Order By of Pig

f = Load ‘foo’ as (x, y);o = Order f by x;Load

Sample( Calculate Histogram)

HDFS

Partition

Sort

Broadcast

Load

Sample( Calculate Histogram)

Partition

Sort

One-to-One

Scatter Gather

Scatter Gather

Page 17: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Agenda•Tez Introduction

•Tez Feature Deep Dive

•Tez Improvement & Debuggability

•Tez Status & Roadmap

Page 18: Apache Tez – Present and Future

© Hortonworks Inc. 2015

• Performance– Speculation– Intermediate File Improvements–Better use of JVM Memory– Shuffle Improvements

• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool

Page 19: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Speculation• Good for clusters having good/slow nodes or heterogeneous

hardware.• Maintains periodic runtime statistics of tasks• Triggers speculative attempt when estimated runtime > mean

runtime

Page 20: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Intermediate File Format Improvements

• Used for storing intermediate data in Tez

• Drawbacks of earlier format–Needs larger buffer in memory (due to

duplicate keys)–Bigger file size in disk–Not ideal for all use cases

• New Intermediate File Format–Works based on (K, List<V>)– Provides 57% memory efficiency and

23% improvement in disk storage

TaskSpill 1 Spill 2 Spill 3

Merged Spill

………………………

New IFile FormatKey Len K1Value Len V1

Value Len V2 V_ENDRLE Value Len V3 …

Key Len K2Value Len V1

Value Len V5 V_ENDRLE Value Len V6 …

Old IFile Format

Key Len Value Len K1 V1

Key Len Value Len K1 V2

Key Len Value Len K1 V3

Key Len Value Len K2 V1

………………………

Key Len Value Len K2 V5

Key Len Value Len K2 V6

Page 21: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Better use of JVM Memory• BytesWritable Improvements

– Provides FastByteSerialization– Saves 8 bytes per key-value pair– Reduces IFile size by 25% – Reduces SERDE costs

• PipelinedSorter can support > 2 GB sort buffers– Containers with higher RAM no longer

limited by 2 GB sort buffer limits– Avoids unnecessary spills in large jobs

• Reduced key comparison costs in PipelinedSorter

Key Value

Key Size Bytes Val Size Bytes

Key Size BytesSize Val Size BytesSize

Serialize to memory Serialize to memory

Serialize to disk Serialize to disk

Page 22: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Better use of JVM Memory - Contd• Enabled RLE in reducer codepath

– Reduced key comparisons in merge codepath– Improved Job Runtime (observed 10% improvement)– Reduced CPU cost

Without Fix

691 seconds

With Fix

621 seconds

Page 23: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Better use of JVM Memory - Contd• WeightedMemoryDistributor for better memory management

in tasks–Observed 26% runtime improvement in tasks

Page 24: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Source Task

….….

Broadcast Shuffle Improvements

Task 1

Task 2

Task N

Task 1

Task 2

Task N

Task 1

Task 2

Task N

Broadcast

From local diskFrom local disk

Source Task

….….

Task 1

Task 2

Task N

Task 1

Task 2

Task N

Task 1

Task 2

Task N

Broadcast

Before Fix After Fix

Page 25: Apache Tez – Present and Future

© Hortonworks Inc. 2015

PipelinedShuffle Improvments• Final merge in source

task is avoided. – Less IO

• Consumers are informed about spill events in advance– Better usage of network

bandwidth– Overlap CPU with

network– For sorted/unsorted

outputs, send data to consumers in chunks

• Observed 20% runtime improvement in queries involving heavy skews

Task 1Spill 1

Task 2

Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N

…..…..

…..…..

Spill 1 Spill 2 Spill 3

Task 1Spill 1

Task 2Spill 1 Spill 2 Spill 3

Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N

…..…..

…..…..

Merged Spill

Normal Shuffle Path

Pipelined Shuffle Path

Page 26: Apache Tez – Present and Future

© Hortonworks Inc. 2015

PipelinedShuffle Improvements

Job Runtime : 925 seconds Job Runtime : 680 seconds- 26% improvement- Avoids final merge (less IO, CPU cost)- Downstream can consume data whenever a spill

is generated

Page 27: Apache Tez – Present and Future

© Hortonworks Inc. 2015

• Performance– Speculation–Better use of JVM Memory– Intermediate File Improvements– Shuffle Improvements

• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool

Page 28: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez UI

Page 29: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez UI

Page 30: Apache Tez – Present and Future

Tez UI

30

Download data from ATS

Page 31: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Better Debuggability– Local Mode• Test Tez Jobs without Hadoop Cluster• Enables Fast Prototyping• Fast Unit Testing• Runs on Single JVM (easy for debugging)• Scheduling / RPC invocations Skipped

Page 32: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Job Analysis Tools• DAG Swimlane

– “$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>”

PrewarmContainer Reuse

Remote Reads

Page 33: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Shuffle Performance Analysis Tools• Analyze Tez logs in Hadoop• Analyze shuffle performance between source / destination

nodes Data transferred from node 7 to rest of the nodes are slow

Page 34: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Shuffle Performance Analysis Tools• Analyze shuffle performance between source / destination

nodes

Page 35: Apache Tez – Present and Future

© Hortonworks Inc. 2015

RoadMap• Shared output edges

– Same output to multiple vertices

• Local mode stabilization

• Optimizing (include/exclude) vertex at runtime

• Partial completion VertexManager

• Co-Scheduling

• Framework stats for better runtime decisions

Page 36: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez – Adoption • Apache Hive

• Start from Hive 0.13• set hive.exec.engine = tez

• Apache Pig• Start from Pig 0.14• pig -x tez

• Cascading

• Flink

Page 36

Page 37: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Tez Community• Useful Links

– http://tez.apache.org/– JIRA : https://issues.apache.org/jira/browse/TEZ– Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git–Mailing Lists

– Dev List: [email protected]– User List: [email protected]– Issues List: [email protected]

• Tez Meetup– http://www.meetup.com/Apache-Tez-User-Group

Page 38: Apache Tez – Present and Future

© Hortonworks Inc. 2015

Thank You!Questions & Answers

Page 38