dachis group pig hackday: pig 202

of 21 /21
® 2011 Dachis Group. dachisgroup.com Dachis Group Las Vegas 2012 Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012

Author: thelabdude

Post on 27-Jan-2015




2 download

Embed Size (px)


Slides for Pig 202 tutorial presented by Timothy Potter at DG Pig Hackday, May 11, 2012


  • 1. dachisgroup.comDachis GroupLas Vegas 2012Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012 2011 Dachis Group.

2. dachisgroup.comAgendaUFO Sightings Data Set 1. Which US city has the most UFO sightings overall? 2. What is the most common UFO shape within a 100 mile radius ofyour answer for #1?Pig Mahout Example: Training 20 Newsgroups Classifier Loading messages using a custom loader Hashed Feature Vectors Train Logistic Regression Model Evaluate Model on held-out Data 2011 Dachis Group. 3. dachisgroup.comUFO Sightings1. What US city has the most UFO sightings overall?2. What is the most common UFO shape within a 100 mile radius of your answer for #1?Using Two Data Sets: UFO sightings data set availablefrom Infochimps US city / states with geo-codesavailable from US Census 2011 Dachis Group. 4. dachisgroup.comLoad Sightings Data19930809 19990816Westminster, CO triangle 1 minuteA white puffy cottonball appeared and then a triangle ...20010111 20010113Pueblo, CO fireball30 sec Blue fireball lights up the skies of colorado and nebraska ...20001026 20030920Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ...ufo_sightings = LOAD ufo/ufo_awesome.tsv AS ( sighted_at: chararray,reported_at: chararray, location: chararray,shape: chararray, Pig provides functions duration: chararray,description: chararray for doing basic text ); munging tasks oruse a UDF ...ufo_sightings_split_loc = FOREACH (FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL ){split_city = REGEX_EXTRACT(TRIM(location), ([A-Z][ws-.]*)(, )([A-Z]{2}), 1);split_state = REGEX_EXTRACT(TRIM(location), ([A-Z][ws-.]*)(, )([A-Z]{2}), 3);city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null);state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);GENERATE city_lc AS city, state_lc AS state, ... 2011 Dachis Group. 5. dachisgroup.com Load US Cities Data with geo-codesCO0862000 02411501Pueblo city138930097 2034229 53.641 0.785 38.273147 -104.612378CO0883835 02412237 Westminster city817152035954681 31.550 2.299 39.882190 -105.064426CO0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002-104.689740us_cities = LOAD dev/data/usa_cities_and_towns.tsv AS ( state:chararray, geoid: chararray, Use projection to ansicode: chararray, name: chararray,select only the fields .... you want to work with: latitude:double, longitude: double city, state, latitude, longitude );us_cities_w_geo = FOREACH us_cities {city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name, ));GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude;}; 2011 Dachis Group. 6. dachisgroup.comWhat US city has the mostUFO sightings overall? Things to consider ... 1. Need to select only sightings from US cities Join sightings data with US city data 1. Need to count sightings for each city Group results from step 1 by state/city and count 2. Need to do a TOP to get the city with the most sightings Descending sort on count and choose the top. 2011 Dachis Group. 7. dachisgroup.comWhat US city has the mostUFO sightings overall?ufo_sightings_with_geo = FOREACH (JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING replicated ) GENERATEufo_sightings_by_city::stateAS state, Inner JOIN byufo_sightings_by_city::city AS city, (state,city) toufo_sightings_by_city::sighted_at AS sighted_at, attach geo-codes to sightingsufo_sightings_by_city::sighted_year AS sighted_year,ufo_sightings_by_city::shapeAS shape,us_cities_w_geo::latitude AS latitude,us_cities_w_geo::longitudeAS longitude;grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count; Group by (state,city) to get number ofmost_freq = ORDER grp_by_state_city BY the_count DESC;top_city_state = LIMIT most_freq 1;DUMP top_city_state;sightings for eachPoor mans TOPCity 2011 Dachis Group. 8. dachisgroup.comWhat US city has the mostUFO sightings overall?(seattle,wa,446,light,47.620499,-122.350876)Seattle only averages 58 sunny days a year.Coincidence?Maybe all the UFOs are coming to look at theSpace Needle? 2011 Dachis Group. 9. dachisgroup.comPig Explain: Pull back thecovers ...pig -x local -e explain -script ufo.pigufo_sightings_with_geo = FOREACH ( JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING replicated) GENERATE ufo_sightings_by_city::state ufo_sightings_by_city::city AS state, AS city,Job 1 - Mapper ufo_sightings_by_city::sighted_at AS sighted_at, ufo_sightings_by_city::sighted_year AS sighted_year, ufo_sightings_by_city::shapeAS shape, us_cities_w_geo::latitude AS latitude, us_cities_w_geo::longitudeAS longitude;grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))Job 1 - ReducerGENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count;most_freq = ORDER grp_by_state_city BY the_count DESC;top_city_state = LIMIT most_freq 1; Job 2 Full Map/ReduceDUMP top_city_state; 2011 Dachis Group. 10. dachisgroup.comWhat is the most commonUFO shape within a 100 mileradius of your answer for #1?Things we need to solve this ...1) Some way to calculate geographicaldistance from a geographical location(lat / lng)2) Iterate over all cities that havesightings to get the distance from ourcentroid3) Filter by distance and count shapes 2011 Dachis Group. 11. dachisgroup.comUDF: User Defined FunctionREGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar;DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance();...with_distance = FOREACH calc_dist { GENERATE city, state, CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles; };Lets build a UDF that uses the Haversine Forumla to calculatedistance between two pointsSee: http://en.wikipedia.org/wiki/Haversine_formula 2011 Dachis Group. 12. dachisgroup.comUDF: User Defined Functionimport org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class GeoDistance extends EvalFunc {public Double exec(Tuple input) throws IOException {if (input == null || input.size() < 4 || input.isNull(0) || input.isNull(1) || input.isNull(2) || input.isNull(3)) { return null;}Double dist = null;try { Double fromLat = (Double)input.get(0); Double fromLng = (Double)input.get(1); Double toLat = (Double)input.get(2); Double toLng = (Double)input.get(3); dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng);} catch (Exception exc) { // better to return null than to throw exception }return dist;}protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) {// details excluded for brevity see http://www.movable-type.co.uk/scripts/latlong.htmlreturn dist;} 2011 Dachis Group. 13. dachisgroup.comWhat is the most commonUFO shape ...top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng;sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))GENERATE FLATTEN($0) AS (state,city,latitude,longitude); Including lat / lng in group bycalc_dist = FOREACH (CROSS sighting_cities, top_city)GENERATE key to help reduce number ofsighting_cities::city AS city, records Im crossingsighting_cities::state AS state,sighting_cities::latitude AS to_lat,sighting_cities::longitude AS to_lng, Pig only supports equi-joins so we need to use CROSSCalcGeoDistance(top_city::from_lat, top_city::from_lng,sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles;near = FILTER calc_dist BY dist_in_miles < 100; to get the lat / lng of the twopoints to calculate distance using our UDFshapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING replicated)generate ufo_sightings_with_geo::shape as shape;count_shapes = FOREACH (GROUP shapes BY shape)GENERATE $0 AS shape, COUNT($1) AS the_count; When joining, list largest relationsorted_counts = ORDER count_shapes BY the_count DESC; first and smallest last and optimizeif possible such as using replicated 2011 Dachis Group. 14. dachisgroup.comVisualize Results In Pig: fs -getmerge sorted_counts sorted_counts.txt In R: shapes