1 spidering the web in python csc 161: the art of programming prof. henry kautz 11/23/2009
TRANSCRIPT
![Page 1: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/1.jpg)
1
Spidering the Web in Python
CSC 161: The Art of ProgrammingProf. Henry Kautz
11/23/2009
![Page 2: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/2.jpg)
2
Parsing HTMLParsing: making the structure implicit in a piece
of text explicit
Recall: Structure in HTML is indicated by matching pairs of opening / closing tags
Key parsing task: find opening / closing tags
Separate out:Kind of tagAttribute / value pairs in opening tag (if any)Text between matching opening and closing tags
![Page 3: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/3.jpg)
3
HTMLParserHTMLParser: a module for parsing HTML
HTMLParser object: parses HTML, but doesn't actually do anything with the parse (doesn't even print it out)
So, how to do something useful?
We define a new kind of object based on HTMLParserNew object class inherits ability to parse HTMLDefines new methods that do interesting things
during the parse
![Page 4: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/4.jpg)
4
![Page 5: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/5.jpg)
5
![Page 6: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/6.jpg)
6
Finding LinksMany applications involve finding the hyperlinks
in a documentCrawling the web
Instead of printing out all tags, let's only print the ones that are links:Tag name is "a"Have an "href" attribute
![Page 7: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/7.jpg)
7
![Page 8: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/8.jpg)
8
![Page 9: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/9.jpg)
9
Relative LinksNot all of the links printed out by LinkParser are complete
URLs research/index.html Missing method (http://), domain name
(www.cs.rochester.edu), and root directory (/u/kautz) Example of a relative URL
Given that Current page is
http://www.cs.rochester.edu/u/kautz/ Relative link is
research/index.html Full URL for link is
http://www.cs.rochester.edu/u/kautz/research/index.html
![Page 10: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/10.jpg)
10
Converting Relative URLs to Full URLs
There's a Python Library for that!
>>> import urlparse
>>> current = http://www.cs.rochester.edu/u/kautz/
>>> link = research/index.html
>>> link_url = urlparse.urljoin(current, link)
>>> print link_url
http://www.cs.rochester.edu/u/kautz/research/index.html
![Page 11: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/11.jpg)
11
Making LinkParser Print Full URLs
We'd like LinkParser to convert relative URLs to full URLs before printing them
Problem: LinkParser doesn't know the URL of the current page! The feed method passes it the contents of the current
page, but not the URL of the current page
Solution: Add a new method to LinkParser that: Is passed the URL of the current page Remembers that URL in a local variable Opens that page and reads it in Feeds itself the data from the page
![Page 12: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/12.jpg)
12
![Page 13: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/13.jpg)
13
![Page 14: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/14.jpg)
14
TopicsHTML, the language of the Web
Accessing web resources in Python
Parsing HTML filesDefining new kinds of objectsHandling error conditions
Building a web spider
Writing CGI scripts in Python
![Page 15: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/15.jpg)
15
Finding Information on the WWW
There is no central index of everything that is on the World Wide WebPart of the design of the
internet: no central point of failure
In the beginning of the WWW, to find information, you had to know where to look, or follow links that were manually created, and hope you find something useful!
![Page 16: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/16.jpg)
16
SpidersTo fill the need for an easier way to find
information on the WWW, spiders were inventedStart at some arbitrary URLRead in that page, and index the contents
For each word on the page, dictionary[word] = URLGet the links from the pageFor each link, apply this algorithm
![Page 17: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/17.jpg)
17
Designing a Web SpiderWe need to get a list of the links from a page,
not just print the list
How can we modify LinkParser2 to do this?
Solution:Create a new local variable in the object to store
the list of linksWhenever handle_starttag finds a link, add the link
to the listThe getLinks method returns this list
![Page 18: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/18.jpg)
18
![Page 19: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/19.jpg)
19
![Page 20: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/20.jpg)
20
Handling Non-HTML PagesSuppose the URL passed to getLinks happens
not to be an HTML page?E.g.: image, video file, pdf file, ...Not always possible to tell from URL alone
Current solution not robust
![Page 21: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/21.jpg)
21
![Page 22: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/22.jpg)
22
Checking Page TypeWe would like to know for sure if a page contains
HTML before trying to parse it
We can't possibly have this information for sure until we open the file
We want to check the type before we read in the data
How to do this?
![Page 23: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/23.jpg)
23
Getting Information about a Web Resource
Fortunately, there's a (a pair of) methods for this!url_object.info().gettype() = string describing the type
of thing to which url_object refers
>>> url_object = urlopen('http://www.cs.rochester.edu/u/kautz/')
>>> url_object.info()
<httplib.HTTPMessage instance at 0x149a058>
>>> url_object.info().gettype()
'text/html'
>>> url_object = urlopen('http://www.cs.rochester.edu/u/kautz/cv.pdf')
>>> url_object.info().gettype()
'application/pdf'
![Page 24: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/24.jpg)
24
One other addition: we'll return the entire original page text as well as its links. This
will be useful soon.
![Page 25: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/25.jpg)
25
![Page 26: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/26.jpg)
26
Handling Other Kinds of Errors
We can now handle pages that are not HTML, but other kinds of errors can still cause the parser to fail: Page does not exist Page requires a password to access
Python has a general way of writing programs that "keep on going" even if something unexpected happens:
try:
some code
except:
run this if code above fails
![Page 27: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/27.jpg)
27
Calling getLinksParser Robustly
parser = getLinksParser()
try:
links = parser.getLinks("http://www.henrykautz.com/")
print links
except:
print "Page not found"
![Page 28: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/28.jpg)
28
SpidersTo fill the need for an easier way to find
information on the WWW, spiders were inventedStart at some arbitrary URLRead in that page, and index the contents
For each word on the page, dictionary[word] = URLGet the links from the pageFor each link, apply this algorithm
![Page 29: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/29.jpg)
29
Simple SpiderTo fill the need for an easier way to find
information on the WWW, spiders were inventedGiven: a word to find Start at some arbitrary URLWhile the word has not been found:
Read in that page and see if the word is on the pageGet the links from the pageFor each link, apply this algorithm
What is dangerous about this algorithm?
![Page 30: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/30.jpg)
30
Safer Simple SpiderTo fill the need for an easier way to find
information on the WWW, spiders were inventedGiven: a word to find and a maximum number of
pages to downloadStart at some arbitrary URLWhile the word has not been found and the
maximum number of pages has not been reached:Read in the page and see if the word is on the pageGet the links from the pageFor each link, apply this algorithm
![Page 31: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/31.jpg)
31
pages_to_visitOur spider needs to keep track of the URLs that
it has found, but has not yet visited
We can use a list for this
Add new links to the end of the list
When we are ready to visit a new page, select (and remove) the URL from the front of the list
In CS 162, you'll learn that this use of a list is called a queue, and this type of search is called breadth-first
![Page 32: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/32.jpg)
32
![Page 33: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/33.jpg)
33
DemonstrationStarting at my personal home page,
www.henrykautz.orgfind an HTML page containing the word "skeleton"
Limit search to 200 pages
![Page 34: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/34.jpg)
34
![Page 35: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/35.jpg)
35
![Page 36: 1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649dda5503460f94ad028f/html5/thumbnails/36.jpg)
36
Coming UpTuesday: written exercises for preparing for final
Will not be handed in If you do them, you'll do well on the exam If you ignore them, you won'tWork on your own or with friends
Wednesday: Next steps in computer science
Monday: Solutions to exercises & course review for final