data-mining the web using perl

41
Data-Mining the Data-Mining the Web Using Perl Web Using Perl Burt L. Monroe Burt L. Monroe Director, Quantitative Social Science Director, Quantitative Social Science Initiative Initiative Department of Political Science Department of Political Science The Pennsylvania State University The Pennsylvania State University

Upload: tommy96

Post on 10-Nov-2014

2.168 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data-Mining the Web Using Perl

Data-Mining the Web Data-Mining the Web Using PerlUsing Perl

Burt L. MonroeBurt L. MonroeDirector, Quantitative Social Science InitiativeDirector, Quantitative Social Science Initiative

Department of Political ScienceDepartment of Political ScienceThe Pennsylvania State UniversityThe Pennsylvania State University

Page 2: Data-Mining the Web Using Perl

Data-Mining the WebData-Mining the Web

ExamplesExamples

• Election Returns in LuxembourgElection Returns in Luxembourg Luxembourg Official Election Results, 2004Luxembourg Official Election Results, 2004 http://qssi.psu.edu/files/luxembourg.plhttp://qssi.psu.edu/files/luxembourg.pl

• Parliamentary SpeechParliamentary Speech The Congressional RecordThe Congressional Record

Page 3: Data-Mining the Web Using Perl

How’d You Do That?How’d You Do That? There are several programming languages There are several programming languages

with “straightforward” facilities for doing with “straightforward” facilities for doing this. Most notably,this. Most notably,• PerlPerl• PythonPython• JavaJava

I’m going to talk about Perl, becauseI’m going to talk about Perl, because• it’s the most establishedit’s the most established• it’s the one I knowit’s the one I know

It appears that Python may be preferable, It appears that Python may be preferable, but that’s for someone else to say.but that’s for someone else to say.

Page 4: Data-Mining the Web Using Perl

What’s Perl?What’s Perl? Open source (free / flexible / extensible / a little Open source (free / flexible / extensible / a little

wild and woolly – like Linux, R) programming wild and woolly – like Linux, R) programming language.language.

It is very very good at processing text.It is very very good at processing text.• note, webpages are just texts.note, webpages are just texts.• note, datasets (like a flat spreadsheet or Stata file) are note, datasets (like a flat spreadsheet or Stata file) are

just texts.just texts.• Social scientists might have some use for turning one Social scientists might have some use for turning one

into the other, no?into the other, no? It has very useful facilities for buildingIt has very useful facilities for building

• SpidersSpiders• ScrapersScrapers• (and “agents”, “robots”, “crawlers”, etc.)(and “agents”, “robots”, “crawlers”, etc.)

Page 5: Data-Mining the Web Using Perl

What’s a Spider?What’s a Spider?

A spider is a program designed to A spider is a program designed to automatically gather webpages.automatically gather webpages.

If, for example, you want to If, for example, you want to automatically download all of the automatically download all of the speeches delivered in Congress speeches delivered in Congress today – without manually clicking on today – without manually clicking on every one, cutting and pasting, etc. – every one, cutting and pasting, etc. – you might want to build a spider.you might want to build a spider.

Page 6: Data-Mining the Web Using Perl

What’s a scraper?What’s a scraper?

A scraper (or “screen-scraper”) A scraper (or “screen-scraper”) extracts the information you want – extracts the information you want – whatever you consider to be data – whatever you consider to be data – from a given webpage.from a given webpage.

If you want to know who said If you want to know who said “health” and how many times, you “health” and how many times, you might want to build a scraper. might want to build a scraper.

Page 7: Data-Mining the Web Using Perl

BEWARE!BEWARE! Spiders (and other similar types of programs – Spiders (and other similar types of programs –

“robots”, “crawlers”) can be put to nefarious use:“robots”, “crawlers”) can be put to nefarious use:• appropriating copyrighted materialsappropriating copyrighted materials• extracting email addresses for spammersextracting email addresses for spammers• overwhelming servers to create “denial of service”overwhelming servers to create “denial of service”• generally violating a site’s “terms of service” or generally violating a site’s “terms of service” or

“acceptable use policy”“acceptable use policy” If you are not careful to use legal and ethical If you are not careful to use legal and ethical

good practices, you cangood practices, you can• be denied access to a website altogetherbe denied access to a website altogether• get yourself or the university sued or even subjected to get yourself or the university sued or even subjected to

criminal penaltiescriminal penalties

Page 8: Data-Mining the Web Using Perl

PerlPerl Open-sourceOpen-source Cross-platformCross-platform

• (Windows – I recommend “ActivePerl” from (Windows – I recommend “ActivePerl” from http://www.activestate.comhttp://www.activestate.com) )

There are many websites with resources:There are many websites with resources:• http://www.cpan.orghttp://www.cpan.org (Comprehensive Perl (Comprehensive Perl

Archive Network)Archive Network)• http://www.perlmonks.orghttp://www.perlmonks.org (PerlMonks) (PerlMonks)• http://www.perl.orghttp://www.perl.org• http://perl.oreilly.comhttp://perl.oreilly.com (O’Reilly Publishing) (O’Reilly Publishing)

Lots of mailing lists, etc.Lots of mailing lists, etc.

Page 9: Data-Mining the Web Using Perl

BooksBooks Basics of PerlBasics of Perl

• The best books are put out by O’Reilly Publishing and The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover.are generally known by the animal on the cover.

• Learning PerlLearning Perl (the Llama) (the Llama) or, Learning Perl on Win32 Systemsor, Learning Perl on Win32 Systems (the Gecko) (the Gecko)

• Programming PerlProgramming Perl (the Camel) (the Camel) Web-miningWeb-mining

• Perl & LWPPerl & LWP (the Blesbok, apparently) (the Blesbok, apparently)• Spidering HacksSpidering Hacks

These books, and some others, are or will be These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216).available in the “QuaSSI Library” (in Pond 216).

Page 10: Data-Mining the Web Using Perl

Running PerlRunning Perl For machines with approved ActivePerl For machines with approved ActivePerl

installations in Pond ...installations in Pond ...• Perl is located in c:/Perl/Perl is located in c:/Perl/

For today, For today, • we will operate entirely in the directory c:/Perl/eg/we will operate entirely in the directory c:/Perl/eg/• To get there,To get there,

open Programs -> Accessories -> Command Promptopen Programs -> Accessories -> Command Prompt At the prompt, type At the prompt, type c:c: Type Type cd Perl/egcd Perl/eg

(In your particular installation, or in a Mac, or (In your particular installation, or in a Mac, or something like Unix on high performance something like Unix on high performance computing, these details will be different.) computing, these details will be different.)

Page 11: Data-Mining the Web Using Perl

The First Perl ProgramThe First Perl Program

Go to the QuaSSI Website for the example Go to the QuaSSI Website for the example scripts for todays workshop:scripts for todays workshop:• http://qssi.psu.edu/files/howdy.plhttp://qssi.psu.edu/files/howdy.pl

Right-click on the first script, “howdy.pl”, Right-click on the first script, “howdy.pl”, and save it to c:\Perl\eg\and save it to c:\Perl\eg\

Open up the text-editor WinEdt (you could Open up the text-editor WinEdt (you could use almost anything) and then open use almost anything) and then open howdy.plhowdy.pl

That’s a complete Perl program.That’s a complete Perl program. Note: that’s all a program is – a text file.Note: that’s all a program is – a text file.

Page 12: Data-Mining the Web Using Perl

Running a Perl ProgramRunning a Perl Program

Go back to your command prompt.Go back to your command prompt. Type Type perl howdy.pl –wperl howdy.pl –w (The (The –w–w tells perl to give you tells perl to give you wwarnings about what might be wrong arnings about what might be wrong if the program is broken.)if the program is broken.)

Page 13: Data-Mining the Web Using Perl

Modifying a programModifying a program

Go back to WinEdtGo back to WinEdt Edit the text between the quotation Edit the text between the quotation

marks to say something newmarks to say something new Click File -> SaveClick File -> Save Go back to the command promptGo back to the command prompt Hit the up arrow (to get the last Hit the up arrow (to get the last

command, command, perl howdy.pl –wperl howdy.pl –w Look at that – you’re a programmer!Look at that – you’re a programmer!

Page 14: Data-Mining the Web Using Perl

Break the programBreak the program

Go back to WinEdtGo back to WinEdt Delete the semicolon at the end of Delete the semicolon at the end of

the linethe line Save the fileSave the file Go back to the command prompt and Go back to the command prompt and

run the program, with run the program, with –w–w, again, again What happened?What happened?

Page 15: Data-Mining the Web Using Perl

Perl at 30,000 feetPerl at 30,000 feet

Much of the next set of slides is Much of the next set of slides is stolen shamelessly from Andy stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at Tester’s “Perl at 10,000 Feet” at www.petdance.comwww.petdance.com

(I’m skipping even more than he did.)(I’m skipping even more than he did.)

Page 16: Data-Mining the Web Using Perl

Some generalities about PerlSome generalities about Perl

Statements in Perl are, or usually can be, Statements in Perl are, or usually can be, constructed in a fairly natural English-like constructed in a fairly natural English-like way.way.

There are many ways to do any one thing.There are many ways to do any one thing. The syntax can be offputting and hard to The syntax can be offputting and hard to

read, especially at first. It is easy to read, especially at first. It is easy to “obfuscate” Perl code and this is “obfuscate” Perl code and this is sometimes done intentionally.sometimes done intentionally.

Main syntax rule: end all lines with Main syntax rule: end all lines with ;;

Page 17: Data-Mining the Web Using Perl

Data TypesData Types

ScalarsScalars Arrays and ListsArrays and Lists HashesHashes ReferencesReferences FilehandlesFilehandles ObjectsObjects

Page 18: Data-Mining the Web Using Perl

ScalarsScalars

NumbersNumbers• Generally decimal floating pointGenerally decimal floating point• (Can be made integer, octal, (Can be made integer, octal,

hexadecimal)hexadecimal) StringsStrings

• Can contain any characterCan contain any character• Can be null: Can be null: “”“”• Can be arbitrarily large Can be arbitrarily large

Page 19: Data-Mining the Web Using Perl

StringsStrings Single-quotedSingle-quoted

• characters are as shown with only two exceptions.characters are as shown with only two exceptions. single-quote single-quote inin a single-quoted string requires a single-quoted string requires \’\’ backslash in a single-quoted string requires backslash in a single-quoted string requires \\\\

Double-quotedDouble-quoted• it will it will interpolateinterpolate – calculate variables or control sequences. – calculate variables or control sequences.

For exampleFor example• $foo = “myfile”;$foo = “myfile”;• $datafile = “$foo.txt”;$datafile = “$foo.txt”;• will result in the variable $datafile holding the string “myfile.txt”will result in the variable $datafile holding the string “myfile.txt”

Another exampleAnother example• print ‘Howdy\n’;print ‘Howdy\n’; will print: will print:

Howdy\nHowdy\n• print “Howdy\n”;print “Howdy\n”; will print will print

HowdyHowdy

• ((\n\n is a control sequence, standing for “new line”). is a control sequence, standing for “new line”).

Page 20: Data-Mining the Web Using Perl

Scalar operatorsScalar operators MathMath

• *, /, % (for modulo), ** (for exponentiation), *, /, % (for modulo), ** (for exponentiation), etc.etc.

StringsStrings• x to repeat the thing on the leftx to repeat the thing on the left

““b” x 10b” x 10 gives “bbbbbbbbbb” gives “bbbbbbbbbb”• . concatenates strings. concatenates strings

(“na” x 16).“ Batman!”(“na” x 16).“ Batman!” gives ... gives ... Perl knows to convert when mixing these Perl knows to convert when mixing these

two types:two types:• ““3”*43”*4 gives 12 gives 12• ““3”.43”.4 gives “34” gives “34”

Page 21: Data-Mining the Web Using Perl

Comparing ScalarsComparing Scalars

ComparisonComparison NumericNumeric StringString EqualEqual ==== eqeq Not equalNot equal !=!= nene Less thanLess than << ltlt Greater thanGreater than >> gtgt Less / equalLess / equal <=<= lele Greater / equalGreater / equal >=>= gege

8 < 258 < 25 TRUE!TRUE!““8” lt “25”8” lt “25” FALSE!FALSE!

Page 22: Data-Mining the Web Using Perl

VariablesVariables A sign, followed by a letter, followed by pretty much A sign, followed by a letter, followed by pretty much

whatever.whatever. Sign determines the type:Sign determines the type:

• $foo$foo is a scalar is a scalar• @foo@foo is a list is a list• %foo%foo is a hash is a hash

Variables default to global (they apply in all parts of your Variables default to global (they apply in all parts of your program). This can be problematic.program). This can be problematic.• local $varlocal $var will make the variable active only for the current will make the variable active only for the current

“block” of code.“block” of code.• my $varmy $var does the same, and is the more usual construction. does the same, and is the more usual construction.• the very common the very common use strictuse strict; at the beginning of code forces ; at the beginning of code forces

good practice in the use of local variables (creates more good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow syntax errors, but prevents more whoppers that could blow everything up.)everything up.)

Page 23: Data-Mining the Web Using Perl

Lists and ArraysLists and Arrays

A list is an ordered set of (usually) A list is an ordered set of (usually) scalars.scalars.

An array is a variable holding a list.An array is a variable holding a list. my @foo = (1,2,3)my @foo = (1,2,3) my @bar = (“elephant”, 3.14)my @bar = (“elephant”, 3.14) Can be constructed as lists of scalar Can be constructed as lists of scalar

variables:variables:• my @data = ($name, $address, $SSN)my @data = ($name, $address, $SSN)

Page 24: Data-Mining the Web Using Perl

Using ArraysUsing Arrays Elements are indexed, from 0.Elements are indexed, from 0.

• my @animals = (“frog”, “bear”, “elephant”);my @animals = (“frog”, “bear”, “elephant”);• print $animals[2];print $animals[2]; # prints elephant # prints elephant• Note: element is a scalar, so $ rather than @Note: element is a scalar, so $ rather than @

Subsections are “slices”.Subsections are “slices”.• my @mammals = @animals[1,2];my @mammals = @animals[1,2];

Lots of functions forLots of functions for• using as a stack (moving things on and off the right or left side using as a stack (moving things on and off the right or left side

of the array).of the array).• sortingsorting• joining two arraysjoining two arrays• splitting a scalar string into an arraysplitting a scalar string into an array

my $sentence = “This is my sentence.”;my $sentence = “This is my sentence.”; my @words = split(“ “, $sentence);my @words = split(“ “, $sentence); # now @words contains (“This”, “is”, “my”, “sentence”);# now @words contains (“This”, “is”, “my”, “sentence”);

Page 25: Data-Mining the Web Using Perl

Programming ControlsProgramming Controls Control structuresControl structures

• if / then / elsif / elseif / then / elsif / else• whilewhile• do {} whiledo {} while• do {} untildo {} until• for ()for ()• foreach() # loops over a listforeach() # loops over a list

Errors / warningsErrors / warnings• die “message” kills program and prints die “message” kills program and prints

“message”.“message”.• warn “message” prints message and keeps warn “message” prints message and keeps

going.going.

Page 26: Data-Mining the Web Using Perl

HashesHashes ““Associative arrays”Associative arrays” A set ofA set of

• values (any scalar), indexed byvalues (any scalar), indexed by• keys (strings)keys (strings)

ExampleExample• my %info;my %info;• $info{ “name” } = “Burt Monroe”;$info{ “name” } = “Burt Monroe”;• $info{ “age” } = 39;$info{ “age” } = 39;

With hashes and arrays you can create almost With hashes and arrays you can create almost any arbitrary data structure (even arrays of any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.)arrays, arrays of hashes, hashes of arrays, etc.)

Page 27: Data-Mining the Web Using Perl

File HandlingFile Handling open() function opens a file for processing.open() function opens a file for processing. Prefix the filename to define howPrefix the filename to define how

• ““<“ for input from existing file (read)<“ for input from existing file (read)• ““>” to create for output (write)>” to create for output (write)• ““>>” to append to a file (that may not yet >>” to append to a file (that may not yet

exist)exist) open (IN, “<myfile.txt”) or die open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”;“Can’t open myfile.txt”;

Can then use <> to refer to the file. The Can then use <> to refer to the file. The above would be <IN>.above would be <IN>.

Page 28: Data-Mining the Web Using Perl

Matching string patterns using Matching string patterns using regular expressionsregular expressions

This is where much of the power of Perl lies.This is where much of the power of Perl lies. m/pattern/m/pattern/ will check the last stored variable ( will check the last stored variable ($_$_) for ) for

pattern.pattern. $var =~ m/pattern/;$var =~ m/pattern/; will check $var for pattern. will check $var for pattern. If the pattern is in $var, thenIf the pattern is in $var, then

• $var =~ m/pattern/$var =~ m/pattern/ is TRUE. is TRUE. If you “group” part of the pattern and it is present,If you “group” part of the pattern and it is present,

• $var =~ m/(pattern)/$var =~ m/(pattern)/ is true, AND, now a variable names $1 is true, AND, now a variable names $1 contains the first match it found.contains the first match it found.

• Group more pieces of the pattern and the matches are stored Group more pieces of the pattern and the matches are stored in $2, $3, etc.in $2, $3, etc.

This only grabs the *first* match. To grab all, sayThis only grabs the *first* match. To grab all, say• my @matches = ($var =~ m/(pattern)/g);my @matches = ($var =~ m/(pattern)/g);• This will store every match in the array @matches.This will store every match in the array @matches.

Page 29: Data-Mining the Web Using Perl

What’s a “regular expression”?What’s a “regular expression”? Combination ofCombination of

any literal character, number, etc.any literal character, number, etc... any single characterany single character* * zero or more of the previouszero or more of the previous+ + one or more of the previousone or more of the previous? ? zero or one of the previouszero or one of the previous[aeiou][aeiou] character class – this is the vowelscharacter class – this is the vowels^ ^ beginning of the linebeginning of the line$ $ end of the lineend of the line\b \b word boundaryword boundary\d \D \d \D digit / non-digitdigit / non-digit\s \S \s \S space / non-spacespace / non-space\w \W \w \W word character / non-word characterword character / non-word character| | or – match this or thator – match this or that() () groupinggrouping

See handout for more.See handout for more.

Page 30: Data-Mining the Web Using Perl

ExamplesExamples Romeo|JulietRomeo|Juliet “Romeo” or “Juliet”“Romeo” or “Juliet” \d\d\d-\d\d\d\d\d\d\d-\d\d\d\d a phone numbera phone number (\d\d\d-)?\d\d\d-\d\d\d\d(\d\d\d-)?\d\d\d-\d\d\d\d phone #, maybe w/ areaphone #, maybe w/ area \b[aeiou]\w+\b[aeiou]\w+ a word starting w/ a vowela word starting w/ a vowel \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b email email

add.add.

Page 31: Data-Mining the Web Using Perl

ModulesModules

Hundreds of modules / packages Hundreds of modules / packages available through cpan. available through cpan.

ActivePerl gives a GUI for installing ActivePerl gives a GUI for installing them in its “Perl Package Manager”.them in its “Perl Package Manager”.

Page 32: Data-Mining the Web Using Perl

A basic Perl exampleA basic Perl example

Counting words.Counting words.

• counter1.plcounter1.pl

Page 33: Data-Mining the Web Using Perl

Grabbing from the webGrabbing from the web

The basic idea is simply to have Perl The basic idea is simply to have Perl act as an “agent”, in the way a act as an “agent”, in the way a browser like Explorer or Firefox does browser like Explorer or Firefox does -- requesting and interpreting -- requesting and interpreting webpages.webpages.

There are a few basic modules that There are a few basic modules that can do this.can do this.

Page 34: Data-Mining the Web Using Perl

LWP::SimpleLWP::Simple

lwpsimpleget.pllwpsimpleget.pl

Page 35: Data-Mining the Web Using Perl

LWP::UserAgentLWP::UserAgent

More elaborate than LWP::Simple.More elaborate than LWP::Simple. I’m going to skip that one today, but I’m going to skip that one today, but

it’s covered in details in the main it’s covered in details in the main booksbooks• Perl & LWPPerl & LWP• Spidering HacksSpidering Hacks

Pretty much all of the functionality Pretty much all of the functionality has been wrapped more intuitively has been wrapped more intuitively into ...into ...

Page 36: Data-Mining the Web Using Perl

WWW::MechanizeWWW::Mechanize

mechanizeget.plmechanizeget.pl

Page 37: Data-Mining the Web Using Perl

ScrapingScraping

At its base, this is just extracting At its base, this is just extracting information from the page(s) you information from the page(s) you download.download.

Simple example:Simple example:• freshair.plfreshair.pl

Page 38: Data-Mining the Web Using Perl

Your agent can interact ...Your agent can interact ...

For example, what if the webpage For example, what if the webpage involves a form ...involves a form ...

ExampleExample• abstracts.plabstracts.pl

You can authenticate with username You can authenticate with username and password, run through proxy and password, run through proxy servers, and so on.servers, and so on.

Page 39: Data-Mining the Web Using Perl

SpidersSpiders Type 1 RequesterType 1 Requester

• Requests a few items with known urls from a website.Requests a few items with known urls from a website. Type 2 RequesterType 2 Requester

• Requests a few items, then requests (some set of) pages to Requests a few items, then requests (some set of) pages to which those items link.which those items link.

Type 3 RequesterType 3 Requester• Starts at a given url, and then requests everything linked, Starts at a given url, and then requests everything linked,

everything linked by that, etc. everything linked by that, etc. at the same host serverat the same host server. The . The idea here is usually to download an entire website.idea here is usually to download an entire website.

Type 4 RequesterType 4 Requester• Starts at a given url, requests everything linked Starts at a given url, requests everything linked anywhereanywhere, ,

everything linked by that, etc. until it, perhaps, visits the entire everything linked by that, etc. until it, perhaps, visits the entire web.web.

YOU – I am talking to YOU – in all likelihood have no YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope.only spiders with known finite scope.

Page 40: Data-Mining the Web Using Perl

Back to the Luxembourg MinerBack to the Luxembourg Miner

Commune-level election results from Commune-level election results from Luxembourg.Luxembourg.

• luxembourg.plluxembourg.pl

Page 41: Data-Mining the Web Using Perl

More on ScrapingMore on Scraping All of the examples scraped / parsed using All of the examples scraped / parsed using

regular expressions.regular expressions.

More structured data like HTML is often better (or More structured data like HTML is often better (or only) addressed with more specialized tools:only) addressed with more specialized tools:• HTML::TokeParserHTML::TokeParser• HTML::TreeBuilderHTML::TreeBuilder

There are modules for scraping from XML, There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs.spreadsheets, databases, Word docs, PDFs.