web scrapping with python
DESCRIPTION
Introduction on how to crawl for sites and content from the unstructured data on the web. using the Python programming language and some existing python modules.TRANSCRIPT
![Page 1: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/1.jpg)
Web Scrapping with Python
Miguel Miranda de Mattos:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012
![Page 2: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/2.jpg)
Web Scrapping with Python
● Tools:
○ BeautifulSoup
○ Mechanize
![Page 3: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/3.jpg)
BeautifulSoup
An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
● In Summary:
○ Navigate the "soup" of HTML/XML tags, programatically
○ Access tag´s properties and values
○ Search for tags and their attributes.
![Page 4: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/4.jpg)
BeautifulSoup○ Example:
from BeautifulSoup import BeautifulSoupdoc = "<html><h1>Heading</h1><p>Text"soup = BeautifulSoup(doc)print soup.prettify()
# <html># <h1># Heading# </h1># <p># Text# </p># </html>
○
![Page 5: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/5.jpg)
BeautifulSoup
○ Searching / Looking for things■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings'
■ findAll● findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
● Extracts a list of Tag objects that match the given● criteria. You can specify the name of the Tag and any● attributes you want the Tag to have.
○
![Page 6: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/6.jpg)
BeautifulSoup
● Example:
>>> from BeautifulSoup import BeautifulSoup>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>">>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr')[<tr><td>one</td><td>two</td></tr>]
>>> print docSoup.findAll('td')[<td>one</td>, <td>two</td>]
![Page 7: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/7.jpg)
BeautifulSoup
● findAll (cont´d.):
>>> for t in docSoup.findAll('td'):>>> print t
<td>one</td><td>two</td>
>>> for t in docSoup.findAll('td'):>>> print t.getText()
onetwo
![Page 8: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/8.jpg)
BeautifulSoup● findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})[<div>musicMenu</div>,<div>videoMenu</div>]
● For more options:
○ dir (BeautifulSoup)○ help (yourSoup.<command>)
● Use BeautifulSoup rather than regexp patterns:patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')re.findAll(patFinderTitle, html)
○ bysoup = BeautifulSoup(html)for tag in brand_row_soup.findAll('a'):print tag['title']
![Page 9: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/9.jpg)
Mechanize
● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module.
● mechanize.Browser and mechanize.UserAgentBase, so:○ any URL can be opened, not just http:○ mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener().
● Easy HTML form filling.● Convenient link parsing and following.● Browser history (.back() and .reload() methods).● The Referer HTTP header is added properly (optional).● Automatic observance of robots.txt.● Automatic handling of HTTP-Equiv and Refresh.
![Page 10: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/10.jpg)
Mechanize
● Navigation commands:○ open(url)
○ follow_link(link)
○ back()
○ submit()
○ reload()
● Examples
br = mechanize.Browser()br.open("python.org")gothtml = br.response().read()for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
![Page 11: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/11.jpg)
Mechanize
● Example:
import reimport mechanize
br = mechanize.Browser()br.open("http://www.example.com/")
# follow second link with element text matching # regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop")
assert br.viewing_html()print br.title()print response1.geturl()print response1.info() # headersprint response1.read() # body
![Page 12: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/12.jpg)
Mechanize
● Example: Combining Mechanize and BeautifulSoup
import reimport mechanizefrom BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"br = mechanize.Browser()
br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:
print d
![Page 13: Web Scrapping with Python](https://reader031.vdocuments.net/reader031/viewer/2022020306/5400dc158d7f72c8628b459c/html5/thumbnails/13.jpg)
Mechanize
● Example: Combining Mechanize and BeautifulSoup
import reimport mechanize
url = "http://www.hp.com"br = mechanize.Browser()
br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')print "Found " + str(len(found_divs))for d in found_divs:
if d.has_key('class'):print d['class']