[2c1] 아파치 피그를 위한 테즈 연산...

박철수 엔지니어 / 넷플릭스 빅데이터플랫폼팀 Netflix Big Data Platform

Apache Pig를 위한 Tez 연산 엔진 개발하기

1. Background 2. What is Pig on Tez? 3. Why Apache Tez? 4. Shortcomings and What’s Next

CONTENTS

1.  Background

1.1 Netflix Data Pipeline

Cloud apps

Suro Ursula

Cassandra SS

Tables Aegisthus

S3 DW

15 min

Daily

Events Data Pipeline

Stateful Data Pipeline

1.2 Netflix Big Data Platform

S3 DW

Hadoop clusters

Federated execution

engine

Federated metadata service

Data Lineage

Data Visualization

Data Movement

Data Quality

Pig Workflow Visualization

Job/Cluster Performance Visualization

1.3 Data Volume

~200 billions events/day

~40 TB incoming data/day (compressed)

~1.2 PB data read/day

~100 TB data wrote/day

10+ PB DW on S3

1.4 Netflix Big Data Platform

S3 DW

Hadoop clusters

Federated execution

engine

Federated metadata service

Data Lineage

Data Visualization

Data Movement

Data Quality

Pig Workflow Visualization

Job/Cluster Performance Visualization

With ever growing data, ETL runs

slower and slower.

1.5 ETL Completion Trend

Common problems across organizations 1.  Similar data platform architecture

1.  Pig for ETL jobs

2.  Hive/Presto for ad-hoc queries

1.6 Common Problems

1.7 Pig on Tez Team

•  Alex Bain (LinkedIn: 2013/08~2014/01, Dev)

•  Mark Wagner (LinkedIn: 2013/08~2014/01, Dev)

•  Cheolsoo Park (Netflix: 2013/08~2014/08, Dev)

•  Olga Natkovich (Yahoo: 2013/08~present, PM)

•  Rohini Palaniswamy (Yahoo: 2013/08~present, Dev)

•  Daniel Dai (Hortonworks: 2013/08~present, Dev)

2. What is Pig on Tez?

Non-blocking operators 1.  LOAD / STORE

2.  FOREACH __ GENERATE __

3.  FILTER __ BY __

Blocking operators 1.  GROUP __ BY __

2.  ORDER __ BY __

3.  JOIN __ BY __

Translated to a MapReduce shuffle

2.1 Pig Concepts

2.2 MapReduce Plan

LOAD

FOREACH

GROUP BY

FOREACH

STORE

LOAD

FOREACH

GLOBAL REARRANGE

FOREACH

PACKAGE

LOCAL REARRANGE

STORE

LOAD

FOREACH

LOCAL REARRANGE

PACKAGE

STORE

FOREACH

Shuffle

Logical Plan

Physical Plan MR Plan

2.3 What’s Problem?

Restrictions by MapReduce 1.  Extra intermediate output on HDFS

2.  Artificial synchronization barriers

3.  Inefficient use of resources

4.  Multi-query optimization

Low-level DAG Framework 1.  Build DAG by defining vertices and edges.

2.  Customize scheduling of DAG and movement of data.

•  Sequential and concurrent

•  1-1, broadcasting, scatter and gather

Flexible Input-Processor-Output Model 1.  Thin API layer to wrap around arbitrary application code.

2.  Compose inputs, processor, and outputs to execute arbitrary processing.

2.4 Tez Concepts

Input Processor Output initialize getReader handleEvents close

initialize run handleEvents close

initialize getWriter handleEvents close

2.5 Pig on Tez Logical Plan

Physical Plan

Tez Plan

Tez Execution Engine

MR Plan

MR Execution Engine

LogToPhyTranslationVisitor

MRCompiler TezCompiler

2.6 Tez DAG: Split + Group By + Join Load ‘foo’

Group by y, Group by z

Join g1, g2

Load g1, Load g2

HDFS HDFS

Split multiplex De-multiplex

Load ‘foo’

Group by y

Group by z

Join g1, g2

Multiple outputs

Reducer follows reducer

a = LOAD ‘foo’ AS (x, y, z); b = GROUP a BY y; c = GROUP a BY z; d = JOIN b BY group; c BY group;

2.7 Tez DAG: Order By Sample

Aggregate

Sort

Load, Partition

HDFS

Load, Sample

Partition

Sort

Aggregate a = LOAD ‘foo’ AS (x, y); b = FILTER a BY y is not null; c = ORDER b BY x;

Stage sample map on distributed cache

Broadcast sample map

1-1 Unsorted edge

Cache sample map

3. Why Apache Tez?

3.1 DAG Execution

DAG Execution 1.  Eliminate HDFS writes between workflow jobs.

2.  Eliminate job launch overhead of workflow jobs.

3.  Eliminate identity mappers in every workflow jobs.

Benefits 1.  Faster execution and higher predictability.

3.2 MR vs. Tez

3.3 AM / Container Reuse

AM Reuse 1.  Grunt shell uses one AM for all commands till timeout.

2.  More than one DAGs submitted for merge join, collected group, and exec.

Container Reuse 1.  Rerun new tasks on already warmed-up JVM.

Benefits 1.  Reduce container launch overhead.

2.  Reduce networks IO.

•  1-1 edge tasks are launched on same node.

3.4 Broadcast Edge / Object Cache

Broadcast Edge 1.  Broadcast same data to all tasks in successor vertex.

Object Cache 1.  Shared in memory objects for scope of vertex and DAG.

Benefits 1.  Replace use of distributed cache.

2.  Avoid input fetching if cache is available on container reuse.

•  Replicated join runs faster on small cluster.

3.5 Vertex Group

Vertex Group 1.  Group multiple vertices into a vertex group and produce a combiner output.

Benefits 1.  Better performance due to elimination of an additional vertex.

Load b Load a

Group

Load b Load a

Union

Group

a = LOAD ‘a’; b = LOAD ‘b’; c = UNION a, b; d = GROUP c BY $0;

3.6 Slow Start/Pre-launch

Slow Start/Pre-launch 1.  Pluggable vertex manager pre-launches the reducers before all maps have co

mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY).

Benefits 1.  Better performance due to parallel execution of multiple vertices.

3.7 Performance Numbers

0

50

100

150

200

250

Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x)

MR

Tez

20m vs 10m

1h22m vs 28m

2h17m vs 1h15m

33m vs 28m

3h57m vs 3h54m

3.8 Performance Deep Dive

This MR job blocks DAG.

3.9 Performance Deep Dive

Huge amount of intermediate files are written to HDFS.

4. Shortcomings And What’s Next

4.1 Shortcomings

Auto Parallelism 1.  Eliminating mappers without adjusting parallelisms can make jobs run slower.

In MR, combiners run with 1600 tasks.

In Tez, combiners Run With 500 tasks.

4.2 Shortcomings

Current Status 1.  User-specified parallelism always takes precedence.

2.  If no parallelism is specified, Pig estimates using static rules. For eg, if vertex

contains filter-by, reduce its parallelism by 50%.

3.  At execution time, parallelism is adjusted again based on per-vertex sampling.

Problems 1.  In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified

parallelism can hurt performance in Tez.

2.  Static-rule-based estimation cannot be always accurate.

3.  Sample-based estimation cannot be always accurate.

4.3 Shortcomings

Web UI and Tools Integration 1.  Tez AM has no UI (i.e. no job page).

2.  Tez hasn’t integrated with YARN ATS (i.e. no job history page).

3.  Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.

4.4 What’s Next?

Tez 1.  Resolve TEZ-8: Tez UI for progress tracking and history.

•  Tez 0.5.x release (latest) doesn’t include TEZ-8.

Pig on Tez 1.  Improve auto parallelism and usability.

•  Pig on Tez will be included in Pig 0.14 release, but these issues might be

still there.

THANK YOU

[2c1] 아파치 피그를 위한 테즈 연산...

Technology