netflix - pig with lipstick by jeff magnusson

40
Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013

Upload: hakka-labs

Post on 18-Dec-2014

1.246 views

Category:

Technology


2 download

DESCRIPTION

In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D. While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts). Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.

TRANSCRIPT

Page 1: Netflix - Pig with Lipstick by Jeff Magnusson

Putting Lipstick on Apache Pig

Big Data Gurus MeetupAugust 14, 2013

Page 2: Netflix - Pig with Lipstick by Jeff Magnusson

Data should be accessible, easy to discover, and easy to process for everyone.

Motivation

Page 3: Netflix - Pig with Lipstick by Jeff Magnusson

Big Data Users at Netflix

Analysts Engineers

DesiresSelf Service

EasyRich Toolset Rich APIs

A Single Platform / Data Architecture that Serves Both Groups

Page 4: Netflix - Pig with Lipstick by Jeff Magnusson

Netflix Data Warehouse - Storage

S3 is the source of truthDecouples storage from processing.Persistent data; multiple/ transient Hadoop clusters

Data sourcesEvent data from cloud services via Ursula/HonuDimension data from Cassandra via Aegisthus

~100 billion events processed / dayPetabytes of data persisted and available to queries on S3.

Page 5: Netflix - Pig with Lipstick by Jeff Magnusson

Netflix Data Platform - Processing

Long running clusterssla and ad-hoc

Supplemental nightly bonus clusters

For high priority ETL jobs2,000+ instances in aggregate across the clusters

Page 6: Netflix - Pig with Lipstick by Jeff Magnusson

Netflix Hadoop Platform as a Service

S3

https://github.com/Netflix/genie

Page 7: Netflix - Pig with Lipstick by Jeff Magnusson

Netflix Data Platform – Primitive Service Layer

Primitive, decoupled services

Building blocks for more complicated tools/services/apps

Serves 1000s of MapReduce Jobs / day

100+ jobs concurrently

Page 8: Netflix - Pig with Lipstick by Jeff Magnusson

Netflix Data Platform – Tools

Sting(Adhoc

Visualization)

Looper(Backloading)

Forklift(Data Movement)

Ignite(A/B Test Analytics)

Lipstick(Workflow

Visualization)

Spock(Data Auditing) Heavily utilize services in the

primitive layer.

Follow the same design philosophy as primitive apps:

RESTful APIDecoupled javascript interfaces

Page 9: Netflix - Pig with Lipstick by Jeff Magnusson

Pig and Hive at Netflix

• Hive– AdHoc queries– Lightweight aggregation

• Pig– Complex Dataflows / ETL– Data movement “glue” between complex

operations

Page 10: Netflix - Pig with Lipstick by Jeff Magnusson

What is Pig?

• A data flow language• Simple to learn– Very few reserved words– Comparable to a SQL logical query plan

• Easy to extend and optimize• Extendable via UDFs written in multiple

languages– Java, Python, Ruby, Groovy, Javascript

Page 11: Netflix - Pig with Lipstick by Jeff Magnusson

Sample Pig Script* (Word Count)input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spacesfiltered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each wordword_groups = GROUP filtered_words BY word; -- count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

* http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example

Page 12: Netflix - Pig with Lipstick by Jeff Magnusson

A Typical Pig Script

Page 13: Netflix - Pig with Lipstick by Jeff Magnusson

Pig…

• Data flows are easy & flexible to express in text– Facilitates code reuse via UDFs and macros– Allows logical grouping of operations vs grouping by order

of execution.– But errors are easy to make and overlook.

• Scripts can quickly get complicated• Visualization quickly draws attention to:– Common errors– Execution order / logical flow– Optimization opportunities

Page 14: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick

• Generates graphical representations of Pig data flows.

• Compatible with Apache Pig v11+• Has been used to monitor more

than 25,000 Pig jobs at Netflix

Page 15: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick

Page 16: Netflix - Pig with Lipstick by Jeff Magnusson

Overall JobProgress

Page 17: Netflix - Pig with Lipstick by Jeff Magnusson

LogicalPlan

Overall JobProgress

Page 18: Netflix - Pig with Lipstick by Jeff Magnusson

Logical Operator(reduce side)

Logical Operator(map side)

Map/Reduce Job

Intermediate Row Count

RecordsLoaded

Page 19: Netflix - Pig with Lipstick by Jeff Magnusson

HadoopCounters

Page 20: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick for Fast Development• During development:– Keep track of data flow– Spot common errors• Omitted (hanging) operators• Data type issues

– Easily estimate and optimize complexity• Number of MR jobs generated• Map only vs full Map/Reduce jobs• Opportunities to rejigger logic to:

– Combine multiple jobs into a single job– Manipulate execution order to achieve better parallelism (e.g.

less blocking)

Page 21: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick for Job Monitoring• During execution:– Graphically monitor execution status from a single

console– Spot optimization opportunities• Map vs reduce side joins• Data skew• Better parallelism settings

Page 22: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick for Support• Empowers users to support themselves– Better operational visibility• What is my script currently doing?• Why is my script slow?

– Examine intermediate output of jobs– All execution information in one place

• Facilitates communication between infrastructure / support teams and end users– Lipstick link contains all information needed to

provide support.

Page 23: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick Architecture

Pig v11+

lipstick-console.jar

Lipstick Server(RESTful

Grails app)

Javascript Client(Frontend GUI)

RDSPersistence

Page 24: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick Architecture - Console• Implements PigProgressNotificationListener interface• Listens for:

1. New statements to be registered (unoptimized plan)2. Script launched event (optimized, physical, M/R plan)3. MR Job completion/failure event4. Heartbeat progress (during execution)

• Pig Plans and Progress Lipstick objects• Communicates with Lipstick Server

Page 25: Netflix - Pig with Lipstick by Jeff Magnusson

Pig Compilation Plans

Optimized Logical Plan

Physical Plan

MapReduce Plan(grouping of Physical Operators into

map or reduce jobs)

Pig Script

Unoptimized Logical Plan(~1:1 logical operator / line of Pig)

Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations.

Page 26: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick Architecture - Server

• Simple REST interface• It’s a Grails app!• Pig client posts plans and puts progress• Javascript client• gets plans and progress• Searches jobs by job name and user name

Page 27: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick Architecture – JS Client

• Displays and annotates graphs with status / progress

• Completely decoupled from Server

• Event based design

• Periodically polls Server for job progress

• Usability is a key focus

Page 28: Netflix - Pig with Lipstick by Jeff Magnusson

My Job has stalled.

Solving Problems with Lipstick - Common Problem #1

Page 29: Netflix - Pig with Lipstick by Jeff Magnusson
Page 30: Netflix - Pig with Lipstick by Jeff Magnusson

Unoptimized/OptimizedLogical Plan Toggle

Dangling Operator

Page 31: Netflix - Pig with Lipstick by Jeff Magnusson

I didn’t get the data I was expecting

Common Problem #2

Page 32: Netflix - Pig with Lipstick by Jeff Magnusson
Page 33: Netflix - Pig with Lipstick by Jeff Magnusson
Page 34: Netflix - Pig with Lipstick by Jeff Magnusson

I don’t understand why my job failed.

Common Problem #3

Page 35: Netflix - Pig with Lipstick by Jeff Magnusson

Failed Job(light red background)

Successful Job(light blue background)

Page 36: Netflix - Pig with Lipstick by Jeff Magnusson

Future of Lipstick• Annotate common errors and inefficiencies on the graph

– Skew / map side join opportunities / scalar issues– E.g. Warnings / error dashboard

• Provide better details of runtime performance– Timings annotated on graph– Min / median / max mapper and reducer times– Map / reduce completion over time

• Search through execution history– Examine trends in runtime and data volumes– History of failure / success

• Search jobs for commonalities– Common datasets loaded / saved– Better grasp data lineage– Common uses of UDFs and macros

Page 37: Netflix - Pig with Lipstick by Jeff Magnusson

Lipstick on HiveHoney?

Page 38: Netflix - Pig with Lipstick by Jeff Magnusson

A closer look…

Page 39: Netflix - Pig with Lipstick by Jeff Magnusson

Wrapping up

• Lipstick is part of Netflix OSS.• Clone it on github at http:

//github.com/Netflix/Lipstick• Check out the quickstart guide– https://github.com/Netflix/Lipstick/wiki/Getting-Started#1

-quick-start

– Get started playing with Lipstick in under 5 minutes!

• We happily welcome your feedback and contributions!

Page 40: Netflix - Pig with Lipstick by Jeff Magnusson

Jeff Magnusson: [email protected] | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson

Thank you!

Jobs: http://jobs.netflix.comNetflix OSS: http://netflix.github.io

Tech Blog: http://techblog.netflix.com/