apache big data europe 2016 power pig with spark...apache pig procedural scripting language pig...

35

Power Pig with Spark Kelly Zhang ([email protected]) Apache Big Data Europe 2016

Upload: others

Post on 04-Jul-2020

16 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Power Pig with Spark

Kelly Zhang ([email protected])

Apache Big Data Europe 2016

Page 2: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 3: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Background

Page 4: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Apache Pig

● Procedural scripting language

● Pig Latin: similar to sql

● Heavily used for ETL

● Schema / No schema data, Pig eats everything

Page 5: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Spark

● Faster

● Generality

● Easy of use

Page 6: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 7: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Why Pig on Spark

● Better Performance

○ No intermediate data between stages

○ In-memory caching abstraction

○ Executor JVM Reuse

● Support Pig users to experience Spark conveniently

Page 8: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 9: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Design Architecture

Page 10: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Design Architecture

Page 11: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Design Architecture

Page 12: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Pig Latin to RDD<Tuple> transformations

Page 13: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Pig Latin to RDD<Tuple> transformations

Page 14: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Pig Latin to RDD<Tuple> transformations

Page 15: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Operator Mapping

Pig Operator Spark Operator

Load newAPIHadoopFile

Store saveAsNewAPIHadoopFile

Filter filter

GroupBy groupby/reduceBy

Join CoGroupRDD

ForEach mapPartitions

Sort sortByKey

Page 16: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 17: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Benchmark Overview

Component Version

Pig Spark branch

Hadoop 2.6.0

Spark 1.6.2

PigMix Trunk

Page 18: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Basic Configuration

spark.master=yarn-client

spark.executor.memory=6553m

spark.yarn.executor.memoryOverhead=1638

spark.executor.cores=8

spark.dynamicAllocation.enabled=true

spark.network.timeout=1200000

Page 19: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Benchmark Overview (cont’d)

Page 20: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 21: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Optimize GroupBy/Join

Page 22: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Optimize GroupBy/Join

Page 23: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Optimize GroupBy/Join

Page 24: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Optimize GroupBy/Join

Page 25: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Skewed Key Sort

Page 26: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Skewed Key Sort

Page 27: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Skewed Key Sort

Page 28: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Salted Key Solution

Page 29: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Skewed Key Sort Performance

There are significant performance Improvement in sort case(L10) and skewed key sort case(L9)

Page 30: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Agenda

● Background

● Why Pig on Spark ?

● Design Architecture

● Benchmark

● Optimization

● Current Status & Future Work

● Q&A

Page 31: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Current Status: Nearing end of Milestone 1

● Functional completeness: DONE

● All Unit Tests Pass: DONE

● Merge Spark Branch to Master: In Code Review

Page 32: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Ongoing Work towards Milestone 2● Implement Optimizations

○ Optimize Group by/Join - PIG-4797: DONE

○ FR Join - PIG-4771: DONE

○ Merge Join - PIG-4810: DONE

○ Skewed Join: UNDER REVIEW

● Enhance Test Infrastructure

○ Use “local-cluster” mode to run unit tests

● Spark Integration

○ Improved error, progress, stats reporting

○ YARN Cluster Mode

Page 33: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Future work: Milestone 3

● Implement More Optimizations

○ Split / MultiQuery using RDD.cache()

○ Compute optimal Shuffle Parallelism

○ Optimize/Redesign Spark Plan

● Code Stablization, Bug Fixes

Page 34: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Contribution welcomed● Git:

○ https://github.com/apache/pig/tree/spark

● Wiki :

○ https://cwiki.apache.org/confluence/display/PIG/Pig

+on+Spark

● Umbrella jira:

○ PIG-4059

https://cwiki.apache.org/confluence/display/PIG/Pig+on+Spark

https://cwiki.apache.org/confluence/display/PIG/Pig+on+Spark

https://cwiki.apache.org/confluence/display/PIG/Pig+on+Spark

Page 35: Apache Big Data Europe 2016 Power Pig with Spark...Apache Pig Procedural scripting language Pig Latin: similar to sql Heavily used for ETL Schema / No schema data, Pig eats everything

Q&A

Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Developing Pig on Tez - events.static.linuxfound.org...Developing Pig on Tez Mark Wagner Committer, Apache Pig LinkedIn Cheolsoo Park VP, Apache Pig Netflix. What is Pig Apache project

Apache Pig: Making data transformation easy

Apache PIG - User Defined Functions

Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig team @ Yahoo! Apache pig

Introduction To Apache Pig at WHUG

Apache Pig Tutorial PDF

Hortonworks Data Platform for HDInsight · • Apache Pig 0.16.0 • Apache Ranger 1.2.0 • Apache Spark 2.4.0 • Apache Sqoop 1.4.7 • Apache Storm 1.2.1 • Apache TEZ 0.9.1

Practical Problem Solving with Apache Hadoop & Pig

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Studio Schema Editor Apache Directory

Apache Directory Studio Schema Editor User Guide

Apache Pig Tutorial

APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

COSC 416 Apache Pig - People | UBC's Okanagan campus · 2013. 1. 27. · Apache Pig Apache Pig is a high-level language for expressing Map-Reduce programs. Pig defines a language

Apache Pig Example

Apache Pig - JavaZone 2013

Apache Pig for Data Science · 2017. 12. 14. · Apache Pig for Data Science Casey Stella April9,2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Studio Schema Editor Apache Directorydirectory.apache.org/.../Apache_Directory_Studio... · Beside the integration in Apache Directory Studio the Apache Directory Studio Schema Editor

Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel [email protected] University

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Apache pig presentation_siddharth_mathur

Pig Udf Schema Example

High-level Programming Languages: Apache Pig and Pig Latin

Classes, Methods and Schema - Apache Isis · Table of Contents 1. Classes, Methods and Schema

Performance Analysis and Optimization of Apache Pig...Apache Pig is a language, compiler, and run-time library for simplifying the de-velopment of data-analytics applications on Apache

EEDC Apache Pig Language

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

Apache PIG introduction

Apache pig as a researcher’s stepping stone

apache pig performance optimizations talk at apachecon 2010

MODULE 2 - pace.edu.in · Apache Pig Apache Pig is high level language that enables programmer to write complex MapReduce transformation using simple scripting language. Pig often