apache tez – present and future

© Hortonworks Inc. 2015

Apache Tez – Present and Future

Jeff Zhang (@zjffdu)Rajesh Balamohan (@rajeshbalamohan)


Agenda•Tez Introduction

•Tez Feature Deep Dive

•Tez Improvement & Debuggability

•Tez Status & Roadmap


I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1 ( Join a & b )

Job 3 ( Group by of c )

Job 2 (Group by of a Join b)

Job 4 (Join of S & R )

Hive - MR

Example of MR versus Tez

Single Job

Hive - Tez

Join a & b

Group by of a Join b

Group by of c

Job 4 (Join of S & R )


Tez – Introduction

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph (DAG).

• Highly customizable to meet a broad spectrum of use cases.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache project and Apache licensed.


What is DAG & Why DAG

ProjectionFilterGroupBy…

JoinUnionIntersect…

Split…

• Directed Acyclic Graph• Any complicated DAG can been composed of the following 3 basic

paradigm – Sequential– Merge– Divide


Expressing DAG in Tez API

• DAG API (Logic View)–Allow user to build DAG– Topological structure of the data computation flow

• Runtime API (Runtime View)–Application logic of each computation unit (vertex)–How to move/read/write data between vertices


DAG API (Logic View)

• Vertex (Processor, Parallelism, Resource, etc…)

• Edge (EdgeProperty)–DataMovement

– Scatter Gather (Join, GroupBy … )– Broadcast ( Pig Replicated Join / Hive Broadcast Join )– One-to-One ( Pig Order by )– Custom


Runtime API (Runtime View)

ProcessorInput Output

• Input– Through which processor receives data on an edge– Vertex can have multiple inputs

• Processor– Application Logic (One vertex one processor)– Consume the inputs and produce the outputs

• Output– Through which processor writes data to an edge– One vertex can have multiple outputs

• Example of Input/Output/Processor– MRInput & MROutput (InputFormat/OutputFormat)– OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather)– UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One)– PigProcessor/HiveProcessor


Benefit of DAG• Easier to express computation in DAG

• No intermediate data written to HDFS

• Less pressure on NameNode

• No resource queuing effort & less resource contention

• More optimization opportunity with more global context


Container-Reuse• Reuse the same container across DAG/Vertices/Tasks

• Benefit of Container-Reuse– Less resources consumed–Reduce overhead of launching JVM–Reduce overhead of negotiate with Resource Manager–Reduce overhead of resource localization–Reduce network IO–Object Caching (Object Sharing)


Tez Session• Multiple Jobs/DAGs in one AM

• Container-reuse across Jobs/DAGs

• Data sharing between Jobs/DAGs


Dynamic Parallelism Estimation • VertexManager

– Listen to the other vertices status

– Coordinate and schedule its tasks

– Communication between vertices


ATS Integration• Tez is fully integrated with YARN ATS (Application Timeline

Service)–DAG Status, DAG Metrics, Task Status, Task Metrics are captured

• Diagnostics & Performance analysis–Data Source for monitoring & diagnostics –Data Source for performance analysis


Recovery• AM can crash in corner cases

–OOM–Node failure–…

• Continue from the last checkpoint

• Transparent to end users

AM Crash


Order By of Pig

f = Load ‘foo’ as (x, y);o = Order f by x;Load

Sample（ Calculate Histogram)

HDFS

Partition

Sort

Broadcast

Load

Sample（ Calculate Histogram)

Partition

Sort

One-to-One

Scatter Gather

Scatter Gather


• Performance– Speculation– Intermediate File Improvements–Better use of JVM Memory– Shuffle Improvements

• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool


Speculation• Good for clusters having good/slow nodes or heterogeneous

hardware.• Maintains periodic runtime statistics of tasks• Triggers speculative attempt when estimated runtime > mean

runtime


Intermediate File Format Improvements

• Used for storing intermediate data in Tez

• Drawbacks of earlier format–Needs larger buffer in memory (due to

duplicate keys)–Bigger file size in disk–Not ideal for all use cases

• New Intermediate File Format–Works based on (K, List<V>)– Provides 57% memory efficiency and

23% improvement in disk storage

TaskSpill 1 Spill 2 Spill 3

Merged Spill

………………………

New IFile FormatKey Len K1Value Len V1

Value Len V2 V_ENDRLE Value Len V3 …

Key Len K2Value Len V1

Value Len V5 V_ENDRLE Value Len V6 …

Old IFile Format

Key Len Value Len K1 V1




………………………




Better use of JVM Memory• BytesWritable Improvements

– Provides FastByteSerialization– Saves 8 bytes per key-value pair– Reduces IFile size by 25% – Reduces SERDE costs

• PipelinedSorter can support > 2 GB sort buffers– Containers with higher RAM no longer

limited by 2 GB sort buffer limits– Avoids unnecessary spills in large jobs

• Reduced key comparison costs in PipelinedSorter

Key Value

Key Size Bytes Val Size Bytes

Key Size BytesSize Val Size BytesSize

Serialize to memory Serialize to memory

Serialize to disk Serialize to disk


Better use of JVM Memory - Contd• Enabled RLE in reducer codepath

– Reduced key comparisons in merge codepath– Improved Job Runtime (observed 10% improvement)– Reduced CPU cost

Without Fix

691 seconds

With Fix

621 seconds


Better use of JVM Memory - Contd• WeightedMemoryDistributor for better memory management

in tasks–Observed 26% runtime improvement in tasks


Source Task

….….

Broadcast Shuffle Improvements

Task 1

Task 2

Task N

…

Task 1

Task 2

Task N

…

Task 1

Task 2

Task N

…

Broadcast

From local diskFrom local disk

Source Task

….….

Task 1

Task 2

Task N

…

Task 1

Task 2

Task N

…

Task 1

Task 2

Task N

…

Broadcast

Before Fix After Fix


PipelinedShuffle Improvments• Final merge in source

task is avoided. – Less IO

• Consumers are informed about spill events in advance– Better usage of network

bandwidth– Overlap CPU with

network– For sorted/unsorted

outputs, send data to consumers in chunks

• Observed 20% runtime improvement in queries involving heavy skews

Task 1Spill 1

Task 2

Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N

…..…..

…..…..

Spill 1 Spill 2 Spill 3

Task 1Spill 1

Task 2Spill 1 Spill 2 Spill 3

Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N

…..…..

…..…..

Merged Spill

Normal Shuffle Path

Pipelined Shuffle Path


PipelinedShuffle Improvements

Job Runtime : 925 seconds Job Runtime : 680 seconds- 26% improvement- Avoids final merge (less IO, CPU cost)- Downstream can consume data whenever a spill

is generated


• Performance– Speculation–Better use of JVM Memory– Intermediate File Improvements– Shuffle Improvements

• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool


Tez UI

Tez UI

30

Download data from ATS


Better Debuggability– Local Mode• Test Tez Jobs without Hadoop Cluster• Enables Fast Prototyping• Fast Unit Testing• Runs on Single JVM (easy for debugging)• Scheduling / RPC invocations Skipped


Job Analysis Tools• DAG Swimlane

– “$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>”

PrewarmContainer Reuse

Remote Reads


Shuffle Performance Analysis Tools• Analyze Tez logs in Hadoop• Analyze shuffle performance between source / destination

nodes Data transferred from node 7 to rest of the nodes are slow


Shuffle Performance Analysis Tools• Analyze shuffle performance between source / destination

nodes


RoadMap• Shared output edges

– Same output to multiple vertices

• Local mode stabilization

• Optimizing (include/exclude) vertex at runtime

• Partial completion VertexManager

• Co-Scheduling

• Framework stats for better runtime decisions


Tez – Adoption • Apache Hive

• Start from Hive 0.13• set hive.exec.engine = tez

• Apache Pig• Start from Pig 0.14• pig -x tez

• Cascading

• Flink


Tez Community• Useful Links

– http://tez.apache.org/– JIRA : https://issues.apache.org/jira/browse/TEZ– Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git–Mailing Lists

– Dev List: [email protected]– User List: [email protected]– Issues List: [email protected]

• Tez Meetup– http://www.meetup.com/Apache-Tez-User-Group

http://tez.apache.org/

http://tez.apache.org/

https://issues.apache.org/jira/browse/TEZ

https://issues.apache.org/jira/browse/TEZ

https://git-wip-us.apache.org/repos/asf/tez.git

https://git-wip-us.apache.org/repos/asf/tez.git

mailto:[email protected]



http://www.meetup.com/Apache-Tez-User-Group


Thank You!Questions & Answers

apache tez – present and future

Technology