who wants to be a munger
DESCRIPTION
Lonestar Ruby Conference presentation about the three step process to data munging.TRANSCRIPT
DanaLSRC 2009
So Who Wants to Be a Munger?
Friday, August 28, 2009
Who am I?
• Dana
• 8 years in corporate world
• Responsible for munging a massive amount of data every day
• Now develop Rails Applications for a living
Friday, August 28, 2009
Why is this important?• We live in a data
driven society
• Companies feed on reports
• Clients have data and want ways to display it
• Important to know what data you have and what needs to happen with it
• The more you know about the final output, the easier you can manipulate the data
Friday, August 28, 2009
The Process
Friday, August 28, 2009
The Rule of 3 In - Munge - Out
• Read data into some construct
• anything that understands each()
• Transform the data
• Output transformed data
• some format that understands puts()
Friday, August 28, 2009
1 - Reading
Friday, August 28, 2009
A Basic Munging Script
open("new_numbers.txt", "w") do |f| File.foreach("numbers.txt") do |n| n.capitalize! f.puts n end end
The output fileThe input fileThe transformation
onetwothreefourfive
OneTwoThreeFourFive
Friday, August 28, 2009
Simplify
• Don’t confuse reading with munging
• May have to read various files for the same output
• Use Ruby’s each() and puts() methods to your advantage
def munge(input, output) input.each do |record| record.capitalize! output.puts record endend
pass some object to munge
pass out another object as output
Friday, August 28, 2009
Why this is better
names = %w[dana james sarah storm gypsy]stream = $stdoutmunge(names, stream)
numbers = open("numbers.txt")stream = open("new_numbers.txt", "w")munge(numbers, stream)
Friday, August 28, 2009
each() and puts()class Rubyist def each yield "i" yield "love" yield "ruby" endend
class Speaker def puts(words) `say #{words}` endend
Friday, August 28, 2009
Reaching ultimate munging power
class Munger
def initialize(input, output) @input = input @output = output end def munge @input.each do |record| munged = yield(record) @output.puts munged unless munged.nil? end end end
m = Munger.new(open("numbers.txt"), open("new_numbers.txt", "w"))m.munge do |n| n.strip! if n =~ /\At/i n.reverse elsif n == "four" nil else n.capitalize endend
Friday, August 28, 2009
• Different kinds of data
• Structured - record oriented data
• Unstructured
• Most difficult to work with
• Vast majority of data reading is pattern matching
Data
Friday, August 28, 2009
SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 2,567 3,263 -21 6,260 5,703 9 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Customer subtotals 2,577 3,273 -21 6,270 5,717 9 ---------- ---------- ------- ---------- ---------- ------- Salesperson subtotals 9,857 8,756 12 45,889 42,556 8 ---------- ---------- ------- ---------- ---------- -------Report Totals 15,008 13,225 13 75,896 72,359 4
Somewhere in between
hierarchical categories
headers
headers
Friday, August 28, 2009
require "munger"
class RossReader def initialize(file) @file = file end
def each open(@file) do |report| report.each do |line| break if line =~ /\AReport Totals/ next if line =~ /\A\s+\z/ or line =~ /\A\s+-/ or line =~ /\b(sub)?totals\b/i yield line end # report.each end # open end # def end
report = Munger.new(RossReader.new("sample_report.txt"), open("ross_writer.txt", "w"))report.munge do |n| nend
Friday, August 28, 2009
SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------
Friday, August 28, 2009
SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var--------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------
Ugly Headers
Friday, August 28, 2009
unpack()• Designed for breaking
up binary data
• Very handy for this kind of fixed-width work
• unpack() takes in a format string
• You describe what the data looks like
• “a” means ascii character
• “x” means skip"cookies and cream".unpack("a7xa3xa5")
["cookies", "and", "cream"]
"--- --- -----".split. map {d|"a#{d.length}" }.join("x")
"a3xa3xa5"
Friday, August 28, 2009
def initialize(file) @file = file @headers = nil @format = nil end
def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line|
... end # report.each end # open end # def def parse_header(headers) @format = headers[3].split.map { |col| "a#{col.size}" }.join("x") @headers = headers[2].unpack(@format).map { |f| f.strip } end
Friday, August 28, 2009
def initialize(file) @file = file @in_header = false @headers = nil @format = nil end
def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /\ASAA_R/ @in_header = true elsif @in_header @in_header = false if line =~ /\A-/ else ... end end # report.each end # open end # def
Friday, August 28, 2009
Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -10044-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -5515-230 53/3 BB PUB BURGER 150 0 0 150k 0 03680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 23681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 563686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -603008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -33010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -73402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -1003403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -143404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29
Friday, August 28, 2009
assoc()• lookup method
• call it on an array of arrays
• pass in the data you want to lookup
• walks through the outer array and returns the inner array that starts with the argument
• slower than a hash - don’t use on LARGE amounts of data
• assoc() becomes a poor man’s ordered hash
names = [["James" , "Gray"], ["Dana", "Gray"]]puts names.assoc("James")
["James", "Gray"]
Friday, August 28, 2009
def initialize(file) ... @categories = [] end
def each open(@file) do |report| ... if line =~ /\A\s+(\w[\w\s]+?)\s+(\d.+?)\s+\z/ if cat = @categories.assoc($1) cat[-1] = $2 else @categories << [$1, $2] end else yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories end end end # report.each end # open end # def
Friday, August 28, 2009
[["Part Code", "44-531"], ["Description", "53/3 CU PRK RIB SOY"], ["Qty Period", "0"], ["QTY LastYr", "0"], ["Var", "0"], ["Lbs Period", "0"], ["Lbs LastYr", "2"], ["Var", "-100"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]][["Part Code", "44-531-0"], ["Description", "100/2.5 CU PRK RIB SOY"], ["Qty Period", "10"], ["QTY LastYr", "21"], ["Var", "-52"], ["Lbs Period", "14"], ["Lbs LastYr", "31"], ["Var", "-55"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]] ...[["Part Code", "5018"], ["Description", "36/5 KING B WAFFLES"], ["Qty Period", "10"], ["QTY LastYr", "10"], ["Var", "0"], ["Lbs Period", "10"], ["Lbs LastYr", "14"], ["Var", "-29"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "19.1 WAFFLES"]]
Friday, August 28, 2009
class RossReader def initialize(file) @file = file @in_header = false @headers = nil @format = nil @categories = [] end
def parse_header(headers) @format = headers[3].split.map { |col| "a#{col.size}" }.join("x") @headers = headers[2].unpack(@format).map { |f| f.strip } end
def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /\ASAA_R/ @in_header = true elsif @in_header @in_header = false if line =~ /\A-/ else break if line =~ /\AReport Totals/ next if line =~ /\A\s+\z/ or line =~ /\A\s+-/ or line =~ /\b(sub)?totals\b/i if line =~ /\A\s+(\w[\w\s]+?)\s+(\d.+?)\s+\z/ if cat = @categories.assoc($1) cat[-1] = $2 else @categories << [$1, $2] end else yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories end end end # report.each end # open end # def end
Friday, August 28, 2009
2 - Writing
Friday, August 28, 2009
require "rubygems"require "faster_csv"
class CSVWriter def initialize @headers = nil end def puts(record) if @headers.nil? @headers = record.map { |field| field.first } FCSV { |csv| csv << @headers } end FCSV { |csv| csv << record.map { |field| field.last } } end end
Friday, August 28, 2009
3 - Munging
Friday, August 28, 2009
require "munger"require "ross_reader"require "csv_writer"
report = Munger.new(RossReader.new(ARGV.shift), CSVWriter.new)report.munge do |record| record.each do |field| if field.last =~ /\A(?:\d+,)+\d+k?\z/ field.last.delete!(",") end field.last.sub!(/\A\d+k\z/) { |num| num.to_i * 1000 } end recordend
Friday, August 28, 2009
So what can I do with all this?
• Output your data into a spreadsheet such as Excel
• Open the data in your text editor
• Import the data into a database
• Let’s see it in action
Friday, August 28, 2009
Examples
Friday, August 28, 2009
require "munger"require "rubygems"require "faster_csv"require "active_record"
class DBWriter def initialize(model, path = "db.sqlite") ActiveRecord::Base.establish_connection( :adapter => "sqlite3", :database => path ) @model = model end def puts(record) @model.create!(record) endend
class PartCode < ActiveRecord::Baseend
unless File.exist? "db.sqlite" class CreatePartCodes < ActiveRecord::Migration def self.up create_table :part_codes do |t| t.string :part_code t.string :description t.integer :qty_period t.integer :qty_lastyr t.integer :qty_var t.integer :lbs_period t.integer :lbs_lastyr t.integer :lbs_var t.string :salesperson t.string :customer t.string :sa_sort_code end end
def self.down drop_table :part_codes end endend
Friday, August 28, 2009
reader = FCSV($stdin, :headers => true, :header_converters => :symbol)writer = DBWriter.new(PartCode)CreatePartCodes.up if defined? CreatePartCodesm = Munger.new(reader, writer)m.munge do |row| row.to_hashend
Friday, August 28, 2009
Congratulations!You, too, are now
a Munger!
Friday, August 28, 2009