who wants to be a munger

33
Dana LSRC 2009 So Who Wants to Be a Munger? Friday, August 28, 2009

Upload: dgray7

Post on 19-Jun-2015

982 views

Category:

Technology


7 download

DESCRIPTION

Lonestar Ruby Conference presentation about the three step process to data munging.

TRANSCRIPT

Page 1: Who Wants To Be a Munger

DanaLSRC 2009

So Who Wants to Be a Munger?

Friday, August 28, 2009

Page 2: Who Wants To Be a Munger

Who am I?

• Dana

• 8 years in corporate world

• Responsible for munging a massive amount of data every day

• Now develop Rails Applications for a living

Friday, August 28, 2009

Page 3: Who Wants To Be a Munger

Why is this important?• We live in a data

driven society

• Companies feed on reports

• Clients have data and want ways to display it

• Important to know what data you have and what needs to happen with it

• The more you know about the final output, the easier you can manipulate the data

Friday, August 28, 2009

Page 4: Who Wants To Be a Munger

The Process

Friday, August 28, 2009

Page 5: Who Wants To Be a Munger

The Rule of 3 In - Munge - Out

• Read data into some construct

• anything that understands each()

• Transform the data

• Output transformed data

• some format that understands puts()

Friday, August 28, 2009

Page 6: Who Wants To Be a Munger

1 - Reading

Friday, August 28, 2009

Page 7: Who Wants To Be a Munger

A Basic Munging Script

open("new_numbers.txt", "w") do |f| File.foreach("numbers.txt") do |n| n.capitalize! f.puts n end end

The output fileThe input fileThe transformation

onetwothreefourfive

OneTwoThreeFourFive

Friday, August 28, 2009

Page 8: Who Wants To Be a Munger

Simplify

• Don’t confuse reading with munging

• May have to read various files for the same output

• Use Ruby’s each() and puts() methods to your advantage

def munge(input, output) input.each do |record| record.capitalize! output.puts record endend

pass some object to munge

pass out another object as output

Friday, August 28, 2009

Page 9: Who Wants To Be a Munger

Why this is better

names = %w[dana james sarah storm gypsy]stream = $stdoutmunge(names, stream)

numbers = open("numbers.txt")stream = open("new_numbers.txt", "w")munge(numbers, stream)

Friday, August 28, 2009

Page 10: Who Wants To Be a Munger

each() and puts()class Rubyist def each yield "i" yield "love" yield "ruby" endend

class Speaker def puts(words) `say #{words}` endend

Friday, August 28, 2009

Page 11: Who Wants To Be a Munger

Reaching ultimate munging power

class Munger

def initialize(input, output) @input = input @output = output end def munge @input.each do |record| munged = yield(record) @output.puts munged unless munged.nil? end end end

m = Munger.new(open("numbers.txt"), open("new_numbers.txt", "w"))m.munge do |n| n.strip! if n =~ /\At/i n.reverse elsif n == "four" nil else n.capitalize endend

Friday, August 28, 2009

Page 12: Who Wants To Be a Munger

• Different kinds of data

• Structured - record oriented data

• Unstructured

• Most difficult to work with

• Vast majority of data reading is pattern matching

Data

Friday, August 28, 2009

Page 13: Who Wants To Be a Munger

SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6

Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 2,567 3,263 -21 6,260 5,703 9 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7

Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Customer subtotals 2,577 3,273 -21 6,270 5,717 9 ---------- ---------- ------- ---------- ---------- ------- Salesperson subtotals 9,857 8,756 12 45,889 42,556 8 ---------- ---------- ------- ---------- ---------- -------Report Totals 15,008 13,225 13 75,896 72,359 4

Somewhere in between

hierarchical categories

headers

headers

Friday, August 28, 2009

Page 14: Who Wants To Be a Munger

require "munger"

class RossReader def initialize(file) @file = file end

def each open(@file) do |report| report.each do |line| break if line =~ /\AReport Totals/ next if line =~ /\A\s+\z/ or line =~ /\A\s+-/ or line =~ /\b(sub)?totals\b/i yield line end # report.each end # open end # def end

report = Munger.new(RossReader.new("sample_report.txt"), open("ross_writer.txt", "w"))report.munge do |n| nend

Friday, August 28, 2009

Page 15: Who Wants To Be a Munger

SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------

Friday, August 28, 2009

Page 16: Who Wants To Be a Munger

SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6

Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7

Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var--------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------

Ugly Headers

Friday, August 28, 2009

Page 17: Who Wants To Be a Munger

unpack()• Designed for breaking

up binary data

• Very handy for this kind of fixed-width work

• unpack() takes in a format string

• You describe what the data looks like

• “a” means ascii character

• “x” means skip"cookies and cream".unpack("a7xa3xa5")

["cookies", "and", "cream"]

"--- --- -----".split. map {d|"a#{d.length}" }.join("x")

"a3xa3xa5"

Friday, August 28, 2009

Page 18: Who Wants To Be a Munger

def initialize(file) @file = file @headers = nil @format = nil end

def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line|

... end # report.each end # open end # def def parse_header(headers) @format = headers[3].split.map { |col| "a#{col.size}" }.join("x") @headers = headers[2].unpack(@format).map { |f| f.strip } end

Friday, August 28, 2009

Page 19: Who Wants To Be a Munger

def initialize(file) @file = file @in_header = false @headers = nil @format = nil end

def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /\ASAA_R/ @in_header = true elsif @in_header @in_header = false if line =~ /\A-/ else ... end end # report.each end # open end # def

Friday, August 28, 2009

Page 20: Who Wants To Be a Munger

Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29

Friday, August 28, 2009

Page 21: Who Wants To Be a Munger

assoc()• lookup method

• call it on an array of arrays

• pass in the data you want to lookup

• walks through the outer array and returns the inner array that starts with the argument

• slower than a hash - don’t use on LARGE amounts of data

• assoc() becomes a poor man’s ordered hash

names = [["James" , "Gray"], ["Dana", "Gray"]]puts names.assoc("James")

["James", "Gray"]

Friday, August 28, 2009

Page 22: Who Wants To Be a Munger

def initialize(file) ... @categories = [] end

def each open(@file) do |report| ... if line =~ /\A\s+(\w[\w\s]+?)\s+(\d.+?)\s+\z/ if cat = @categories.assoc($1) cat[-1] = $2 else @categories << [$1, $2] end else yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories end end end # report.each end # open end # def

Friday, August 28, 2009

Page 23: Who Wants To Be a Munger

[["Part Code", "44-531"], ["Description", "53/3 CU PRK RIB SOY"], ["Qty Period", "0"], ["QTY LastYr", "0"], ["Var", "0"], ["Lbs Period", "0"], ["Lbs LastYr", "2"], ["Var", "-100"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]][["Part Code", "44-531-0"], ["Description", "100/2.5 CU PRK RIB SOY"], ["Qty Period", "10"], ["QTY LastYr", "21"], ["Var", "-52"], ["Lbs Period", "14"], ["Lbs LastYr", "31"], ["Var", "-55"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]] ...[["Part Code", "5018"], ["Description", "36/5 KING B WAFFLES"], ["Qty Period", "10"], ["QTY LastYr", "10"], ["Var", "0"], ["Lbs Period", "10"], ["Lbs LastYr", "14"], ["Var", "-29"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "19.1 WAFFLES"]]

Friday, August 28, 2009

Page 24: Who Wants To Be a Munger

class RossReader def initialize(file) @file = file @in_header = false @headers = nil @format = nil @categories = [] end

def parse_header(headers) @format = headers[3].split.map { |col| "a#{col.size}" }.join("x") @headers = headers[2].unpack(@format).map { |f| f.strip } end

def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /\ASAA_R/ @in_header = true elsif @in_header @in_header = false if line =~ /\A-/ else break if line =~ /\AReport Totals/ next if line =~ /\A\s+\z/ or line =~ /\A\s+-/ or line =~ /\b(sub)?totals\b/i if line =~ /\A\s+(\w[\w\s]+?)\s+(\d.+?)\s+\z/ if cat = @categories.assoc($1) cat[-1] = $2 else @categories << [$1, $2] end else yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories end end end # report.each end # open end # def end

Friday, August 28, 2009

Page 25: Who Wants To Be a Munger

2 - Writing

Friday, August 28, 2009

Page 26: Who Wants To Be a Munger

require "rubygems"require "faster_csv"

class CSVWriter def initialize @headers = nil end def puts(record) if @headers.nil? @headers = record.map { |field| field.first } FCSV { |csv| csv << @headers } end FCSV { |csv| csv << record.map { |field| field.last } } end end

Friday, August 28, 2009

Page 27: Who Wants To Be a Munger

3 - Munging

Friday, August 28, 2009

Page 28: Who Wants To Be a Munger

require "munger"require "ross_reader"require "csv_writer"

report = Munger.new(RossReader.new(ARGV.shift), CSVWriter.new)report.munge do |record| record.each do |field| if field.last =~ /\A(?:\d+,)+\d+k?\z/ field.last.delete!(",") end field.last.sub!(/\A\d+k\z/) { |num| num.to_i * 1000 } end recordend

Friday, August 28, 2009

Page 29: Who Wants To Be a Munger

So what can I do with all this?

• Output your data into a spreadsheet such as Excel

• Open the data in your text editor

• Import the data into a database

• Let’s see it in action

Friday, August 28, 2009

Page 30: Who Wants To Be a Munger

Examples

Friday, August 28, 2009

Page 31: Who Wants To Be a Munger

require "munger"require "rubygems"require "faster_csv"require "active_record"

class DBWriter def initialize(model, path = "db.sqlite") ActiveRecord::Base.establish_connection( :adapter => "sqlite3", :database => path ) @model = model end def puts(record) @model.create!(record) endend

class PartCode < ActiveRecord::Baseend

unless File.exist? "db.sqlite" class CreatePartCodes < ActiveRecord::Migration def self.up create_table :part_codes do |t| t.string :part_code t.string :description t.integer :qty_period t.integer :qty_lastyr t.integer :qty_var t.integer :lbs_period t.integer :lbs_lastyr t.integer :lbs_var t.string :salesperson t.string :customer t.string :sa_sort_code end end

def self.down drop_table :part_codes end endend

Friday, August 28, 2009

Page 32: Who Wants To Be a Munger

reader = FCSV($stdin, :headers => true, :header_converters => :symbol)writer = DBWriter.new(PartCode)CreatePartCodes.up if defined? CreatePartCodesm = Munger.new(reader, writer)m.munge do |row| row.to_hashend

Friday, August 28, 2009

Page 33: Who Wants To Be a Munger

Congratulations!You, too, are now

a Munger!

Friday, August 28, 2009