web scraping and regex

45
Web Scraping and Regex

Upload: coby

Post on 24-Feb-2016

77 views

Category:

Documents


0 download

DESCRIPTION

Web Scraping and Regex. Ruby Gems. Add-on libraries Don’t reinvent the wheel Syntax: g em install _______. Scraping. Parsing HTML with Nokogiri. Strings. Strings are a sequence of characters denoted by single or double quotes. "a" "puts" "John's book" "12+100" - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Scraping and Regex

Web Scraping and Regex

Page 2: Web Scraping and Regex

Ruby Gems

• Add-on libraries• Don’t reinvent the wheel• Syntax:– gem install _______

Page 3: Web Scraping and Regex

Scraping

• Parsing HTML with Nokogiri

Page 4: Web Scraping and Regex
Page 5: Web Scraping and Regex

Strings

Strings are a sequence of characters denoted by single or double quotes. • "a"• "puts"• "John's book"• "12+100"• 'To be or not to be, that is the question...'

Page 6: Web Scraping and Regex

String methods

• "abc".upcase #=> "ABC"• "DEF".downcase #=> "def"• "abcdef".reverse #=> "fedcba"• "ABCdef".capitalize #=> "Abcdef"• "dog park".length #=> 8

Page 7: Web Scraping and Regex

Mixing datatypes

• puts 2 + 2 #=> 4• puts "2" + "2" #=> "22"

Page 8: Web Scraping and Regex

Conversion Methods

• "42".to_i + 42 #=> 84• "42" + 42.to_s #=> "4242"

Page 9: Web Scraping and Regex

What class is it?

• 42.class # Fixnum• 42.0.class # Float • "42".class # String

Page 10: Web Scraping and Regex

Combining strings and numbers

• puts "The result of 7 + 7 is " + (7 + 7).to_s • #=> The result of 7 + 7 is 14

Page 11: Web Scraping and Regex

Interpolation – another way

• puts "The result of 7 + 7 is #{7+7}" • #=> The result of 7 + 7 is 14

• puts "#{10 * 10} is greater than #{9 * 11}" • #=> 100 is greater than 99

Page 12: Web Scraping and Regex

Interpolation

• Notice how the expressions inside the #{ } are evaluated before being included in the string.

• Two requirements here:– The string must be enclosed in double-quotes– Use a pound sign # followed by curly braces {} to

enclose the Ruby code.

Page 13: Web Scraping and Regex

Exercise

Write the following strings using interpolation:• "1 + 1 is: " + (1+1).to_s• "There were 12 cases of a dozen eggs each (" +

(12 * 12).to_s + ")"• "His name is " + "jon".capitalize

Page 14: Web Scraping and Regex

Solution

Solution• "1 + 1 is: #{1+1}"• "There were 12 cases of a dozen eggs each

(#{12 * 12})"• "His name is #{"jon".capitalize}"

Page 15: Web Scraping and Regex

Backslash is the “escape character”

• "He asked Scarlet, \"Frankly my dear, do I give a damn?\" To which Scarlet responded, \"No.\""

Page 16: Web Scraping and Regex

New Line

puts "Doe, a deer, a female deer.\nRay, a drop of golden\nsun."

#=> Doe, a deer, a female deer.#=> Ray, a drop of golden#=> sun.

Page 17: Web Scraping and Regex

Common uses of escaped characters

• \n– a newline

• \t– a tab space

• \"– a literal quotation mark

• \'– a literal apostrophe

• \\– a literal backslash

Page 18: Web Scraping and Regex

What will display?

1. puts "He's a good doctor, and thorough."2. puts '"I\'ve been at sea."'3. puts 'Maude said to him: "He's a good doctor, and

thorough"'4. puts 'Out of order'.upcase5. puts "We're going to #{'sea world'.upcase}"6. puts "There were #{12*2/4} sheep and #{"three" + "

sheepdogs"} out over at #{"Cherry".upcase} Creek."7. puts '#{2*2}score and #{"7"} years ago'

Page 19: Web Scraping and Regex

What will display?

1. "He's a good doctor, and thorough."2. "I've been at sea."3. There is an unterminated String here. The original String

terminates at "He's because it is a single-quoted String. A new – and unterminated – String begins at the " after thorough.

4. OUT OF ORDER5. We're going to SEA WORLD6. There were 6 sheep and three sheepdogs out over at

CHERRY Creek.7. #{2*2}score and #{"7"} years ago [string interpolation isn't

done in single-quoted strings]

Page 20: Web Scraping and Regex

String Substitutionputs "The cat and the hat".sub("hat", "rat") #=> The cat and the ratputs "Another brick in the wall".sub("brick in the", "") #=> "Another wall"

Page 21: Web Scraping and Regex

Global Substitutionputs "I own an iPad, iPhone and an iPod".gsub('i', 'my')#=> I own an myPad, myPhone and an myPod

Note that character case matters.

Page 22: Web Scraping and Regex

Regular Expressions

• Programming Ruby 1.9 by Dave Thomas (more commonly known as the pickaxe book) sums up what you do with regular expressions in three words

• --test, extract, change.

Page 23: Web Scraping and Regex

AKA Regex

• You've probably used your word processor's find-and-replace to do substitutions, such as:– Replace all occurrences of "NYC" with "New York

City".

• With a regular expression, you can do the same find-and-replace action but catch "N.Y.C", "N.Y.", "NY, NY", "nyc" and any other slight variations in spelling and capitalizations, all in one go.

Page 24: Web Scraping and Regex
Page 25: Web Scraping and Regex

^\s*\n• ^

– The caret stands for the start of the line. It indicates that we are interested in a pattern from the very beginning of a given line. This is also referred to as an anchor.

• \s*– The \s stands for any whitespace character. The asterisks *

indicates that we are looking for 0 or more of these whitespaces. So the regex will work if there are no whitespaces or many whitespaces from the beginning of the line.

• \n– This is a special character for a newline

Page 26: Web Scraping and Regex
Page 27: Web Scraping and Regex

^ +

• ^– Again, this is the beginning-of-the-line anchor.

• – The empty space is just a literal empty space. We

could've also used \s• +– The plus +, known as the greedy operator, looks

for one or more of the previous token, which in our example, is a whitespace.

Page 28: Web Scraping and Regex
Page 29: Web Scraping and Regex

\[\d+\]

• \[– Square brackets are a special character in regexes. But

we don't want that special meaning. We just want a literal square bracket, so we escape it using a backslash \

• \d+– The \d represents any numerical digit. Thus, when

followed by the greedy operator +, the \d+ matches one or more numerical digits.

• \]– This just matches the literal closing square bracket

Page 30: Web Scraping and Regex

Years as 4-digit

3-10-201011-7-061-6-20074-14-087-10-20111-11-0912-9-116-1-105-6-2009

Page 31: Web Scraping and Regex

Regex

• Open up your editor's find-and-replace and in the Find box, type in:

-(\d{2})$

• In the Replace box (your text editor's flavor of regexes may use a backslash \ instead of a dollar sign), type:

-20$1

Page 32: Web Scraping and Regex

-(\d{2})$• -

– This is just a normal, i.e. literal, i.e. "non-special" hyphen.• ()

– Parentheses are special regex characters that capture the pattern within them for later use (in the Replace field). In our current example, we want to use whatever the current year value is (e.g. 07 or 11) and prepend a 20 to it.

• \d– A d would normally just match the letter "d". But with a backslash, this becomes a special

regex character that matches any numerical digit.• {2}

– Curly braces allow you to specify the exact number of occurrences of the pattern preceding the braces. Therefore, the regex {2} will match whatever pattern precedes it exactly two times

• $– The dollar sign $ will match the end of the line. We use it in our dates example because we

want only to match the last digits of each line. Otherwise, the regex would match the day values because they also begin with a hyphen (ex. 8-20-10).

Page 33: Web Scraping and Regex

20$1• The only thing special here is the $1 (again, your text editor may

use backslashes instead of dollar signs, e.g. \1).• Remember those parentheses we used in the Find pattern? The

characters matched by the pattern they encompassed are considered a captured group.

• They can be retrieved for use – in this case, the Replace field – by using a dollar sign and the captured group's numerical order.

• We only had one set of parentheses, so $1 grabs the first (and only set). If we had used two sets of parentheses, $2 would retrieve the value between the second set of ()

Page 34: Web Scraping and Regex

Regex in Ruby

puts "My cat eats catfood".sub("cat", "dog")# => My dog eats catfood • If you passed in /cat/, you'd get the same result as

above, as the letters cat match their literal values:

puts "My cat eats catfood".sub(/cat/, "dog")# => My dog eats catfood

Page 35: Web Scraping and Regex

gsubputs "My cat eats catfood".gsub("cat", "dog")# => My dog eats dogfood

Page 36: Web Scraping and Regex

We need regexstr="My cat goes catatonic when I concatenate his food with Muscat grapes”

puts str.gsub("cat", "dog")

# => My dog goes dogatonic when I condogenate his food with Musdog grapes

Page 37: Web Scraping and Regex

With regexstr="My cat gets catatonic when I attempt to concatenate his food with Muscat grapes”

puts str.gsub(/\bcat\b/, 'dog')

=> My dog gets catatonic when I attempt to concatenate his food with Muscat grapes

Page 38: Web Scraping and Regex

String.matchcontract = "Hughes Missile Systems Company, Tucson, Arizona, is being awarded a $7,311,983 modification to a firm fixed price contract for the FY94 TOW missile production buy, total 368 TOW 2Bs. Work will be performed in Tucson, Arizona, and is expected to be completed by April 30, 1996. Of the total contract funds, $7,311,983 will expire at the end of the current fiscal year. This is a sole source contract initiated on January 14, 1991. The contracting activity is the U.S. Army Missile Command, Redstone Arsenal, Alabama (DAAH01-92-C-0260).”

Page 39: Web Scraping and Regex

mtch = contract.match(/\$[\d,]+/)

puts mtch

#=> $7,311,983#=> $6,952,821

Page 40: Web Scraping and Regex

[\d,]+• The \$ matches a literal dollar sign. The [\d,] matches any

character that is either a numerical digit or a comma.

• The plus sign + is the greedy operator and it will match the pattern that precedes it one or more times. Therefore:

• ...will match any of the following strings:– 12,000– 42– 912,345,200– ,,342134,,3,4,5

Page 41: Web Scraping and Regex

Match datesmtch = contract.match(/\w+ \d{1,2}, \d{4}/)

puts mtch

#=> April 30, 1996#=> May 31, 1996

\w can be used to match any alphanumeric character.

Or if you want to be more precise in matching the month names, you can use a character set, such as [A-Za-z]

Page 42: Web Scraping and Regex

If Else

if my_bank_account_balance > 50.00 puts "I'm eating steak!"else puts "I'm eating ramen :("end

Page 43: Web Scraping and Regex

Examples

if val > 10 puts "Big" end

if val > 10 && val <= 0 puts "Small" end

Page 44: Web Scraping and Regex
Page 45: Web Scraping and Regex

An Exercise

• Uses regex• Ask user for email and uses regex to check

that it is valid.• Change the name of a group of files– http://www.rexegg.com/regex-uses.html