Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn)

In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.


What is Apache Pig?Pig Latin, a high levelAn engine thatdata processingexecutes Piglanguage.Latin locally or on a Hadoop cluster.Pig-latin example Query : Get the list of web pages visited by users whoseage is between 20 and 29 years.USERS = load users as (uid, age);USERS_20s = filter USERS by age >= 20 and age < 30;PAGES = load pages as (url, uid);JOINED = join USERS_20s by uid, PAGES by uid;RESULT = foreach JOINED generate url; Pig relation-as-scalar Getting results of step 1 (average_gpa) Join result of step 1 with students relation, or Write result into file, then use udf to read from file Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime Runtime exception if al_rel has more than one record. UDF in Scripting Language Benefit Use legacy code Use library in scripting language Leverage Hadoop for non-Java programmer Currently supported language Python (0.8) JavaScript (0.8) Ruby (0.10) Extensible Interface Minimum effort to support another language Writing a Python UDFWrite a Python UDFregister util.py using jython as util;@outputSchema("word:chararray") B = foreach A generate util.square(i);def concat(word):return word + word Invoke Python functions when [email protected]("squareSchema") Type conversiondef square(num): Python simple type Pig simple typeif num == None: Python Array Pig Bagreturn None Python Dict Pig Mapreturn ((num)*(num)) Pyton Tuple Pig Tupledef squareSchema(input):return input Use NLTK in Pig Exampleregister nltk_util.py using jython as nltk;Pig eats everythingB = foreach A generate nltk.tokenize(sentence) Tokenizenltk_util.py Stemmingimport nltkporter = nltk.PorterStemmer()(Pig)@outputSchema("words:{(word:chararray)}")(eat)def tokenize(sentence): (everything)tokens = nltk.word_tokenize(sentence)words = [porter.stem(t) for t in tokens]return words Comparison with Pig StreamingPig Streaming Scripting UDF B = stream A through `perlB = foreach A generateSyntax sample.pl`;myfunc.concat(a0, a1), a2;function parameter/returnstdin/tout Input/Output valueentire relation particular fieldsNeed to parse input/convert Type conversion isType Conversion type automatic Every streaming operator Organize the functions intoModularizeneed a separate script module Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params);Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) {myudf.pydef square(num): square JythonFunction(square)def concat(word):concat JythonFunction(concat)def count(bag):countJythonFunction(count)}} Algebraic UDF in JRubyclass SUM < AlgebraicPigUdf output_schema Schema.longdef initial numnumInitial Functionenddef intermed numnum.flatten.inject(:+)Intermediate Functionenddef final numintermed(num)Final Functionendend Pig Embedding Embed Pig inside scripting languagePythonJavaScript Algorithms which cannot complete using one Pig scriptIterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc Parallel Independent execution Ensemble Divide and Conquer Branching Pig Embeddingfrom org.apache.pig.scripting import Pig Compile Piginput= ":INPATH:/singlefile/studenttab10kScriptP = Pig.compile("""A = load $in as (name, age, gpa); store A into output;""")Q = P.bind({in:input})Bind Variablesresult = Q.runSingle() Launch Pig Scriptresult = stats.result(A)for t in result.iterator(): Iterate result print t Convergence ExampleP = Pig.compile(DEFINE myudf MyUDF($param); A = load input; B = foreach A generate MyUDF(*); store B into output; )while True:Q = P.bind({ param:new_parameter})Bind to new parameterresults = Q.runSingle()iter = results.result("result").iterator()if converged:Convergence checkbreaknew_parameter = xxxxxxChange parameter Pig Embedding Running embeded Pig scriptpig sample.py while True: What happen within Pig?Q = P.bind()results = Q.runSingle() While Loop converge? Pig Script PythoPytho nnsample.pyScriptPigScript JythonPigEnd Nested Operator Nested Operator: Operator inside foreachB = group A by name;C = foreach B {C0 = limit A 10;generate flatten(C0);} Prior Pig 0.10, supported nested operatorDISTINCT, FILTER, LIMIT, and ORDER BY New operators added in 0.10CROSS, FOREACH Nested Cross/ForEach (i0, a)(i0, 0)A= B= (i0, b)(i0,1) a 0 CoGroup A, B C= (i0, , ) b 1 (a, 0)C = CoGroup A, B; Cross A, B (a,1) D = ForEach C { (i0, (b, 0)X = Cross A, B; (b,1) Y = ForEach X generateCONCAT(f1, f2); (a0) Generate Y;ForEach CONCAT (a1) (i0, } (b0) (b1) HCatalog Integration Hcatalog PigMap Reduce Hive HCatalog HCatLoader/HCatStorageLoad/Store from HCatalog from Pig HCatalog DDL Integration (Pig 0.11)sql create table student(name string, age int, gpa double); Misc Loaders HBaseStoragePig builtin AvroStoragePiggybank CassandraStorageIn Cassandra code base MongoStorageIn Mongo DB code base JsonLoader/JsonStoragePig builtin 