web scraping and regex

Web Scraping and Regex

Ruby Gems

• Add-on libraries• Don’t reinvent the wheel• Syntax:– gem install _______

Scraping

• Parsing HTML with Nokogiri

Strings

Strings are a sequence of characters denoted by single or double quotes. • "a"• "puts"• "John's book"• "12+100"• 'To be or not to be, that is the question...'

String methods

• "abc".upcase #=> "ABC"• "DEF".downcase #=> "def"• "abcdef".reverse #=> "fedcba"• "ABCdef".capitalize #=> "Abcdef"• "dog park".length #=> 8

Mixing datatypes

• puts 2 + 2 #=> 4• puts "2" + "2" #=> "22"

Conversion Methods

• "42".to_i + 42 #=> 84• "42" + 42.to_s #=> "4242"

What class is it?

• 42.class # Fixnum• 42.0.class # Float • "42".class # String

Combining strings and numbers

• puts "The result of 7 + 7 is " + (7 + 7).to_s • #=> The result of 7 + 7 is 14

Interpolation – another way

• puts "The result of 7 + 7 is #{7+7}" • #=> The result of 7 + 7 is 14

• puts "#{10 * 10} is greater than #{9 * 11}" • #=> 100 is greater than 99

Interpolation

• Notice how the expressions inside the #{ } are evaluated before being included in the string.

• Two requirements here:– The string must be enclosed in double-quotes– Use a pound sign # followed by curly braces {} to

enclose the Ruby code.

Exercise

Write the following strings using interpolation:• "1 + 1 is: " + (1+1).to_s• "There were 12 cases of a dozen eggs each (" +

(12 * 12).to_s + ")"• "His name is " + "jon".capitalize

Solution

Solution• "1 + 1 is: #{1+1}"• "There were 12 cases of a dozen eggs each

(#{12 * 12})"• "His name is #{"jon".capitalize}"

Backslash is the “escape character”

• "He asked Scarlet, \"Frankly my dear, do I give a damn?\" To which Scarlet responded, \"No.\""

New Line

puts "Doe, a deer, a female deer.\nRay, a drop of golden\nsun."

#=> Doe, a deer, a female deer.#=> Ray, a drop of golden#=> sun.

Common uses of escaped characters

• \n– a newline

• \t– a tab space

• \"– a literal quotation mark

• \'– a literal apostrophe

• \\– a literal backslash

What will display?

1. puts "He's a good doctor, and thorough."2. puts '"I\'ve been at sea."'3. puts 'Maude said to him: "He's a good doctor, and

thorough"'4. puts 'Out of order'.upcase5. puts "We're going to #{'sea world'.upcase}"6. puts "There were #{12*2/4} sheep and #{"three" + "

sheepdogs"} out over at #{"Cherry".upcase} Creek."7. puts '#{2*2}score and #{"7"} years ago'

What will display?

1. "He's a good doctor, and thorough."2. "I've been at sea."3. There is an unterminated String here. The original String

terminates at "He's because it is a single-quoted String. A new – and unterminated – String begins at the " after thorough.

4. OUT OF ORDER5. We're going to SEA WORLD6. There were 6 sheep and three sheepdogs out over at

CHERRY Creek.7. #{2*2}score and #{"7"} years ago [string interpolation isn't

done in single-quoted strings]

String Substitutionputs "The cat and the hat".sub("hat", "rat") #=> The cat and the ratputs "Another brick in the wall".sub("brick in the", "") #=> "Another wall"

Global Substitutionputs "I own an iPad, iPhone and an iPod".gsub('i', 'my')#=> I own an myPad, myPhone and an myPod

Note that character case matters.

Regular Expressions

• Programming Ruby 1.9 by Dave Thomas (more commonly known as the pickaxe book) sums up what you do with regular expressions in three words

• --test, extract, change.

AKA Regex

• You've probably used your word processor's find-and-replace to do substitutions, such as:– Replace all occurrences of "NYC" with "New York

City".

• With a regular expression, you can do the same find-and-replace action but catch "N.Y.C", "N.Y.", "NY, NY", "nyc" and any other slight variations in spelling and capitalizations, all in one go.

^\s*\n• ^

– The caret stands for the start of the line. It indicates that we are interested in a pattern from the very beginning of a given line. This is also referred to as an anchor.

• \s*– The \s stands for any whitespace character. The asterisks *

indicates that we are looking for 0 or more of these whitespaces. So the regex will work if there are no whitespaces or many whitespaces from the beginning of the line.

• \n– This is a special character for a newline

^ +

• ^– Again, this is the beginning-of-the-line anchor.

• – The empty space is just a literal empty space. We

could've also used \s• +– The plus +, known as the greedy operator, looks

for one or more of the previous token, which in our example, is a whitespace.

\[\d+\]

• \[– Square brackets are a special character in regexes. But

we don't want that special meaning. We just want a literal square bracket, so we escape it using a backslash \

• \d+– The \d represents any numerical digit. Thus, when

followed by the greedy operator +, the \d+ matches one or more numerical digits.

• \]– This just matches the literal closing square bracket

Years as 4-digit

3-10-201011-7-061-6-20074-14-087-10-20111-11-0912-9-116-1-105-6-2009

Regex

• Open up your editor's find-and-replace and in the Find box, type in:

-(\d{2})$

• In the Replace box (your text editor's flavor of regexes may use a backslash \ instead of a dollar sign), type:

-20$1

-(\d{2})$• -

– This is just a normal, i.e. literal, i.e. "non-special" hyphen.• ()

– Parentheses are special regex characters that capture the pattern within them for later use (in the Replace field). In our current example, we want to use whatever the current year value is (e.g. 07 or 11) and prepend a 20 to it.

• \d– A d would normally just match the letter "d". But with a backslash, this becomes a special

regex character that matches any numerical digit.• {2}

– Curly braces allow you to specify the exact number of occurrences of the pattern preceding the braces. Therefore, the regex {2} will match whatever pattern precedes it exactly two times

• $– The dollar sign $ will match the end of the line. We use it in our dates example because we

want only to match the last digits of each line. Otherwise, the regex would match the day values because they also begin with a hyphen (ex. 8-20-10).

20$1• The only thing special here is the $1 (again, your text editor may

use backslashes instead of dollar signs, e.g. \1).• Remember those parentheses we used in the Find pattern? The

characters matched by the pattern they encompassed are considered a captured group.

• They can be retrieved for use – in this case, the Replace field – by using a dollar sign and the captured group's numerical order.

• We only had one set of parentheses, so $1 grabs the first (and only set). If we had used two sets of parentheses, $2 would retrieve the value between the second set of ()

Regex in Ruby

puts "My cat eats catfood".sub("cat", "dog")# => My dog eats catfood • If you passed in /cat/, you'd get the same result as

above, as the letters cat match their literal values:

puts "My cat eats catfood".sub(/cat/, "dog")# => My dog eats catfood

gsubputs "My cat eats catfood".gsub("cat", "dog")# => My dog eats dogfood

We need regexstr="My cat goes catatonic when I concatenate his food with Muscat grapes”

puts str.gsub("cat", "dog")

# => My dog goes dogatonic when I condogenate his food with Musdog grapes

With regexstr="My cat gets catatonic when I attempt to concatenate his food with Muscat grapes”

puts str.gsub(/\bcat\b/, 'dog')

=> My dog gets catatonic when I attempt to concatenate his food with Muscat grapes

String.matchcontract = "Hughes Missile Systems Company, Tucson, Arizona, is being awarded a $7,311,983 modification to a firm fixed price contract for the FY94 TOW missile production buy, total 368 TOW 2Bs. Work will be performed in Tucson, Arizona, and is expected to be completed by April 30, 1996. Of the total contract funds, $7,311,983 will expire at the end of the current fiscal year. This is a sole source contract initiated on January 14, 1991. The contracting activity is the U.S. Army Missile Command, Redstone Arsenal, Alabama (DAAH01-92-C-0260).”

mtch = contract.match(/\$[\d,]+/)

puts mtch

#=> $7,311,983#=> $6,952,821

[\d,]+• The \$ matches a literal dollar sign. The [\d,] matches any

character that is either a numerical digit or a comma.

• The plus sign + is the greedy operator and it will match the pattern that precedes it one or more times. Therefore:

• ...will match any of the following strings:– 12,000– 42– 912,345,200– ,,342134,,3,4,5

Match datesmtch = contract.match(/\w+ \d{1,2}, \d{4}/)

puts mtch

#=> April 30, 1996#=> May 31, 1996

\w can be used to match any alphanumeric character.

Or if you want to be more precise in matching the month names, you can use a character set, such as [A-Za-z]

If Else

if my_bank_account_balance > 50.00 puts "I'm eating steak!"else puts "I'm eating ramen :("end

Examples

if val > 10 puts "Big" end

if val > 10 && val <= 0 puts "Small" end

An Exercise

• Uses regex• Ask user for email and uses regex to check

that it is valid.• Change the name of a group of files– http://www.rexegg.com/regex-uses.html

web scraping and regex

Documents

unterminated string

string methodsabc

original string

string interpolation

female deer

good doctor

sea worldthere

new lineputs doe