pig and python to process big data
DESCRIPTION
April 8th, 2013 Presentation to Omaha Dynamic Languages User GroupTRANSCRIPT
![Page 1: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/1.jpg)
Big Data with Pig and Python
Shawn HermansOmaha Dynamic Languages User Group
April 8th, 2013
Tuesday, April 9, 13
![Page 2: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/2.jpg)
About Me
• Mathematician/Physicist turned Consultant
• Graduate Student in CS at UNO
• Current Software Engineer at Sojern
Tuesday, April 9, 13
![Page 3: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/3.jpg)
Working with Big Data
Tuesday, April 9, 13
![Page 4: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/4.jpg)
What is Big Data?Data Source Size
Wikipedia Database Dump 9GB
Open Street Map 19GB
Common Crawl 81TB
1000 Genomes 200TB
Large Hadron Collider 15PB annually
Gigabytes - Normal size for relational
databases
Terabytes - Relational databases may
start to experience scaling issues
Petabytes - Relational databases
struggle to scale without a lot of fine tuning
Tuesday, April 9, 13
![Page 5: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/5.jpg)
Working With DataExpectation Reality
• Different File Formats
• Missing Values
• Inconsistent Schema
• Loosely Structured
• Lots of it
Tuesday, April 9, 13
![Page 6: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/6.jpg)
MapReduce
Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview
• Map - Emit key/value pairs from data
• Reduce - Collect data with common keys
• Tries to minimize moving data between nodes
Tuesday, April 9, 13
![Page 7: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/7.jpg)
MapReduce Issues
• Very low-level abstraction
• Cumbersome Java API
• Unfamiliar to data analysts
• Rudimentary support for data pipelines
Tuesday, April 9, 13
![Page 8: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/8.jpg)
Pig• Eats anything
• SQL-like, procedural data flow language
• Extensible with Java, Jython, Groovy, Ruby or JavaScript
• Provides opportunities to optimize workflows
Tuesday, April 9, 13
![Page 9: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/9.jpg)
Alternatives• Java MapReduce API
• Hadoop Streaming
• Hive
• Spark
• Cascading
• Cascalog
Tuesday, April 9, 13
![Page 10: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/10.jpg)
Python
• Data analysis - pandas, numpy, networkx
• Machine learning - scikits.learn, milk
• Scientific - scipy, pyephem, astropysics
• Visualization - matplotlib, d3py, ggplot
Tuesday, April 9, 13
![Page 11: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/11.jpg)
Pig Features
Tuesday, April 9, 13
![Page 12: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/12.jpg)
Input/Output• HBase
• JDBC Database
• JSON
• CSV/TSV
• Avro
• ProtoBuff
• Sequence File
• Hive Columnar
• XML
• Apache Log
• Thrift
• Regex
Tuesday, April 9, 13
![Page 13: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/13.jpg)
Relational OperatorsLIMIT GROUP FILTER CROSS
COGROUP JOIN STORE DISTINCT
FOREACH LOAD ORDER UNION
Tuesday, April 9, 13
![Page 14: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/14.jpg)
Built In FunctionsCOS SIN AVG SUM
COUNT RANDOM LOWER UPPER
CONCAT MAX MIN TOKENIZE
Tuesday, April 9, 13
![Page 15: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/15.jpg)
User Defined Functions• Easy way to add arbitrary code to Pig
• Eval - Filter, aggregate, or evaluate
• Storage - Load/Store data
• Full support for Java and Jython
• Experimental support for Groovy, Ruby and JavaScript
Tuesday, April 9, 13
![Page 16: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/16.jpg)
Census Example
Tuesday, April 9, 13
![Page 17: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/17.jpg)
Getting Data
Tuesday, April 9, 13
![Page 18: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/18.jpg)
Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"
• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV
• TSV > CSV
Tuesday, April 9, 13
![Page 19: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/19.jpg)
Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))
[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .
Tuesday, April 9, 13
![Page 20: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/20.jpg)
Pig Quick Start
pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058
http://pig.apache.org/releases.html
https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
• Download Pig Distribution
• Untar package
• Start Pig in local mode
Tuesday, April 9, 13
![Page 21: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/21.jpg)
Loading Data
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
Tuesday, April 9, 13
![Page 22: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/22.jpg)
Extracting Data
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};
Tuesday, April 9, 13
![Page 23: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/23.jpg)
Adding Schema
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}
Tuesday, April 9, 13
![Page 24: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/24.jpg)
Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;
("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)
Tuesday, April 9, 13
![Page 25: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/25.jpg)
Storing Data
grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');
ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160
Tuesday, April 9, 13
![Page 26: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/26.jpg)
Space Catalog Example
Tuesday, April 9, 13
![Page 27: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/27.jpg)
Space Catalog
• 14,000+ objects in public catalog
• Use Two Line Element sets to propagate out positions and velocities
• Can generate over 100 million positions & velocities per day
Tuesday, April 9, 13
![Page 28: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/28.jpg)
Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537
• Use Python script to convert to Pig friendly TSV
• Create Python UDF to parse TLE into parameters
• Use Python UDF with Java libraries to propagate out positions
Tuesday, April 9, 13
![Page 29: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/29.jpg)
Python UDFs
• Easy way to extend Pig with new functions
• Uses Jython which is at Python 2.5
• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)
• Can use Java classes
Tuesday, April 9, 13
![Page 30: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/30.jpg)
TLE parsing
def parse_tle_number(tle_number_string): split_string = tle_number_string.split('-‐') if len(split_string) == 3: new_number = '-‐' + str(split_string[1]) + 'e-‐' + str(int(split_string[2])+1) elif len(split_string) == 2: new_number = str(split_string[0]) + 'e-‐' + str(int(split_string[1])+1) elif len(split_string) == 1: new_number = '0.' + str(split_string[0]) else: raise TypeError('Input is not in the TLE float format') return float(new_number)
54-61 BSTAR Drag (Decimal Assumed)
-11606-4
Full parser at https://gist.github.com/shawnhermans/4569360
Tuesday, April 9, 13
![Page 31: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/31.jpg)
Simple UDF
import tleparser
@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params
Tuesday, April 9, 13
![Page 32: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/32.jpg)
Extract Parameters
grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);
grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);
([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])
Tuesday, April 9, 13
![Page 33: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/33.jpg)
Storing Results
grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');
Tuesday, April 9, 13
![Page 34: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/34.jpg)
UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4
@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time
while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))
current_time += increment
return ecef_positions
Tuesday, April 9, 13
![Page 35: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/35.jpg)
Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}
Tuesday, April 9, 13
![Page 36: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/36.jpg)
Result
(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)
Tuesday, April 9, 13
![Page 37: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/37.jpg)
Pig on Amazon EMR
Tuesday, April 9, 13
![Page 38: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/38.jpg)
Tuesday, April 9, 13
![Page 39: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/39.jpg)
Tuesday, April 9, 13
![Page 40: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/40.jpg)
Tuesday, April 9, 13
![Page 41: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/41.jpg)
Tuesday, April 9, 13
![Page 42: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/42.jpg)
Tuesday, April 9, 13
![Page 43: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/43.jpg)
Pig with EMR
Tuesday, April 9, 13
![Page 44: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/44.jpg)
Pig with EMR
• SSH in to box to run interactive Pig session
• Load data to/from S3
• Run standalone Pig scripts on demand
Tuesday, April 9, 13
![Page 45: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/45.jpg)
Conclusion
Tuesday, April 9, 13
![Page 46: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/46.jpg)
Other Useful Tools• Python-dateutil : Super-duper date parser
• Oozie : Hadoop workflow engine
• Piggybank and Elephant Bird : 3rd party pig libraries
• Chardet: Character detection library for Python
Tuesday, April 9, 13
![Page 47: Pig and Python to Process Big Data](https://reader033.vdocuments.net/reader033/viewer/2022061223/54c631af4a7959991a8b461f/html5/thumbnails/47.jpg)
Parting Thoughts• Great ETL tool/language
• Flexible enough to write general purpose MapReduce jobs
• Limited, but emerging 3rd party libraries
• Jython for UDFs is extremely limiting (Spark?)
Twitter: @shawnhermansEmail: [email protected]
Tuesday, April 9, 13