how to make your map-reduce jobs perform as well as pig: lessons from pig optimizations thejas nair...

22
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apac he.org Thejas Nair pig team @ Yahoo! Apache pig PMC member

Upload: marisa-kellam

Post on 14-Dec-2015

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

http://pig.apache.org

Thejas Nair

pig team @ Yahoo!

Apache pig PMC member

Page 2: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

What is Pig?

Pig Latin, a high level data processing language.

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Page 3: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Pig Latin example

Users = load ‘users’ as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name, Pages by user;

Page 4: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Comparison with MR in Java

020406080

100120140160180

Hadoop Pig

1/20 the lines of code

0

50

100

150

200

250

300

Hadoop Pig

Minutes

1/16 the development time

What about Performance ?

Page 5: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Pig Compared to Map Reduce

• Faster development time• Data flow versus programming logic• Many standard data operations (e.g. join)

included• Manages all the details of connecting

jobs and data flow• Copes with Hadoop version change

issues

Page 6: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

And, You Don’t Lose Power

• UDFs can be used to load, evaluate, aggregate, and store data

• External binaries can be invoked

• Metadata is optional

• Flexible data model

• Nested data types

• Explicit data flow programming

Page 7: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Pig performance

• Pigmix : pig vs mapreduce

Page 8: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Pig optimization principles

• vs RDBMS: There is absence of accurate models for data, operators and execution env

• Use available reliable info. Trust user choice.

• Use rules that help in most cases

• Rules based on runtime information

Page 9: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Logical Optimizations

Restructure given logical dataflow graph

• Apply filter, project, limit early

• Merge foreach, filter statements

• Operator rewrites

ScriptA = loadB = foreachC = filter

Logical PlanA -> B -> C

Parser Logical Optimizer

Optimized L. PlanA -> C -> B

Page 10: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Physical Optimizations

Physical plan: sequence of MR jobs having physical operators.

• Built-in rules. eg. use of combiner

• Specified in query - eg. join type

Optimized L. PlanX -> Y -> Z

Optimizer

Phy/MR planM(PX-PYm) R(PYr) -> M(Z)

Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr)->M(Z)

Translator

Page 11: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Hash Join

PagesPages UsersUsers

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;

Map 1Map 1

Pagesblock nPagesblock n

Map 2Map 2

Usersblock mUsers

block m

Reducer 1Reducer 1

Reducer 2Reducer 2

(1, user)

(2, name)

(1, fred)(2, fred)(2, fred)

(1, jane)(2, jane)(2, jane)

Page 12: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Skew Join

PagesPages UsersUsers

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’;

Map 1Map 1

Pagesblock nPagesblock n

Map 2Map 2

Usersblock mUsers

block m

Reducer 1Reducer 1

Reducer 2Reducer 2

(1, user)

(2, name)

(1, fred, p1)(1, fred, p2)(2, fred)

(1, fred, p3)(1, fred, p4)(2, fred)

SPSP

SPSP

Page 13: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Merge Join

PagesPages UsersUsers

aaron . . . . . . . .zach

aaron . . . . . .zach

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’;

Map 1Map 1

Map 2Map 2

UsersUsers

UsersUsers

PagesPages

PagesPages

aaron…amr

aaron…

amy…barb

amy…

Page 14: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Replicated Join

PagesPagesUsersUsersaaron

aaron . . . . . . .zach

aaron .zach

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’;

Map 1Map 1

Map 2Map 2

UsersUsersPagesPages

PagesPages

aaron…amr

aaron . zach

amy…barb

UsersUsersaaron . zach

Page 15: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Group/cogroup optimizations• On sorted and ‘collected’ data grp = group Users by name using ‘collected’;

PagesPages

aaronaaronbarneycarol . . . . . . .zach

Map 1Map 1

aaronaaronbarney

Map 2Map 2

carol . .

Page 16: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Multi-store scriptA = load ‘users’ as (name, age, gender, city, state);B = filter A by name is not null;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);store D into ‘bydemo’;C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into ‘bystate’;

A: loadA: load B: filterB: filter

C2: groupC2: group

C1: groupC1: group

C3: eval udfC3: eval udf

C2: eval udfC2: eval udf

store into ‘bystate’store into ‘bystate’

store into ‘bydemo’store into ‘bydemo’

Page 17: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Multi-Store Map-Reduce Planmapmap filterfilter

local rearrangelocal rearrange

splitsplit

local rearrangelocal rearrange

reducereduce

multiplexmultiplexpackagepackage packagepackage

foreachforeach foreachforeach

Page 18: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Memory Management

Use disk if large objects don’t fit into memory

• JVM limit > phy mem - Very poor performance

• Spill on memory threshold notification from JVM - unreliable

• pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.

Page 19: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Other optimizations

• Aggressive use of combiner, secondary sort

• Lazy deserialization in loaders

• Better serialization format

• Faster regex lib, compiled pattern

Page 20: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Future optimization work

• Improve memory management

• Join + group in single MR, if same keys used

• Even better skew handling

• Adaptive optimizations

• Automated hadoop tuning

• …

Page 21: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Pig - fast and flexible

More flexibility in 0.8, 0.9

• Udfs in scripting languages (python)

• MR job as relation

• Relation as scalar• Turing complete pig (0.9)

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 22: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations  Thejas Nair pig team @ Yahoo! Apache pig

Further reading

• Docs - http://pig.apache.org/docs/r0.7.0/

• Papers and talks - http://wiki.apache.org/pig/PigTalksPapers

• Training videos in vimeo.com (search ‘hadoop pig’)